SlideShare une entreprise Scribd logo
1  sur  23
NER for Europeana Newspapers
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin –
Preußischer Kulturbesitz
Background
Why Named Entity Recognition?
• Analysis* of query log files from the National Library of Wales
newspaper website: a vast majority of searches queries contain
either person or place names
* Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis:
A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne)
• Improving Information
Retrieval
• Linking to authority files
(Linked Data)
• Historical Social Network
Analysis (HNA/SNA)
Languages
• Dutch (1614 – 1900)
• French (1814 – 1944)
• German (1721 – 1949)
• Together approx. 50% of the total collection
Many challenges
• Historical data (language)
• Noisy data (OCR)
• Multilingual data
• Lack of extensive metadata
• Lack of open resources
(tagged corpora, gazetteers)
• Lack of common annotation guidelines
• Limitations of annotation tools
Technology
Reuse of existing NER tools
• Simple evaluation of
– Apache OpenNLP
– Stanford CoreNLP
– GATE
• Choice of using Stanford CoreNLP since
– Java-based (thread safe, scalable)
– Good performance (f-measure)
– Strong and active community
– Rather robust against noisy input (CRF)
Approach
• Adaptation of Stanford CoreNLP by the
KB National Library of the Netherlands
to directly consume ENMAP (= Europeana
Newspapers METS/ALTO profile) objects
Approach
• Export option ALTO v3 with tags added
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0"
VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5">
</String>
<String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"
VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10">
</String>
…
<Tags>
<NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/>
<NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/>
</Tags>
Annotation
• Quick evaluation of annotation tools:
– BRAT
– WebANNO
– INL Attestation Tool
• Choice of INL Attestation Tool since:
– Optimized for tagging speed
– Supported by consortium partner (INL/IVDNT)
Corpus creation
• Selection of 100 pages each per language
• Processing of the OCRed texts with
StanfordNER to get initial tagging results
• Manual verification and annotation
Corpus statistics
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
ner-app
https://github.com/EuropeanaNewspapers/ner-app
ner-corpora
https://github.com/EuropeanaNewspapers/ner-corpora
Evaluation: NL
Evaluation FR
Evaluation DE
• A Named Entity Recognition Shootout for
German
M. Riedl and S. Padó. Proceedings of ACL,
Melbourne, Australia, (2018).To appear.
NER vs OCR success rate
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
NER
OCR
Future Plans
Improving performance
• Possible additional features
– Distributional similarity (Clark 2003)
– Semantic generalization (Faruqui & Padò 2010)
– Word embeddings (Braune 2017)
• Gazetteers
– Person names, historical place names
• Data cleanup and improvement
– https://github.com/EuropeanaNewspapers/
ner-corpora/wiki
Trias NER
• Combination and voting of different NER
classifiers, e.g.
– Stanford CoreNLP
– Spacy
– NLTK
• Inspiration:
https://github.com/KBNLresearch/Trias_NER
Disambiguation
• Disambiguation of person and place names
• Inspiration:
https://github.com/KBNLresearch/europeana
np-dbpedia-disambiguation
Linking
• Linking of recognised and disambiguated NE‘s
to authority files (e.g. Wikidata, GND)
• Inspiration:
https://github.com/KBNLresearch/dac

Contenu connexe

Similaire à Named Entity Recognition for Europeana Newspapers

Forum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentationForum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentation
CELI
 

Similaire à Named Entity Recognition for Europeana Newspapers (20)

Data integration in ENFIN using standards. The EnCore DAS service.
Data integration in ENFIN using standards. The EnCore DAS service.Data integration in ENFIN using standards. The EnCore DAS service.
Data integration in ENFIN using standards. The EnCore DAS service.
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
 
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Celtic language technologies in the digital age
Celtic language technologies in the digital ageCeltic language technologies in the digital age
Celtic language technologies in the digital age
 
Audiovisual collections, the spoken word and user needs of scholars in the Hu...
Audiovisual collections, the spoken word and user needs of scholars in the Hu...Audiovisual collections, the spoken word and user needs of scholars in the Hu...
Audiovisual collections, the spoken word and user needs of scholars in the Hu...
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
2010 Digital Humanities London - Dutch Republic of Letters
2010 Digital Humanities London - Dutch Republic of Letters2010 Digital Humanities London - Dutch Republic of Letters
2010 Digital Humanities London - Dutch Republic of Letters
 
Pyathon Program.pdf
Pyathon Program.pdfPyathon Program.pdf
Pyathon Program.pdf
 
Iasa Presentatie
Iasa PresentatieIasa Presentatie
Iasa Presentatie
 
Correlating languages and sentiment analysis on the basis of text-based reviews
Correlating languages and sentiment analysis on the basis of text-based reviewsCorrelating languages and sentiment analysis on the basis of text-based reviews
Correlating languages and sentiment analysis on the basis of text-based reviews
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Integration of an Automatic Indexing System within the Document Flow of a Gre...
Integration of an Automatic Indexing System within the Document Flow of a Gre...Integration of an Automatic Indexing System within the Document Flow of a Gre...
Integration of an Automatic Indexing System within the Document Flow of a Gre...
 
Smart Content - FREME Project - Presentation Frankfurt Book Fair
Smart Content - FREME Project - Presentation Frankfurt Book FairSmart Content - FREME Project - Presentation Frankfurt Book Fair
Smart Content - FREME Project - Presentation Frankfurt Book Fair
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
 
An HLT profile of the official South African languages
An HLT profile of the official South African languagesAn HLT profile of the official South African languages
An HLT profile of the official South African languages
 
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)
 
Forum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentationForum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentation
 

Plus de cneudecker

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
cneudecker
 

Plus de cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Named Entity Recognition for Europeana Newspapers

  • 1. NER for Europeana Newspapers Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
  • 3. Why Named Entity Recognition? • Analysis* of query log files from the National Library of Wales newspaper website: a vast majority of searches queries contain either person or place names * Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis: A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne) • Improving Information Retrieval • Linking to authority files (Linked Data) • Historical Social Network Analysis (HNA/SNA)
  • 4. Languages • Dutch (1614 – 1900) • French (1814 – 1944) • German (1721 – 1949) • Together approx. 50% of the total collection
  • 5. Many challenges • Historical data (language) • Noisy data (OCR) • Multilingual data • Lack of extensive metadata • Lack of open resources (tagged corpora, gazetteers) • Lack of common annotation guidelines • Limitations of annotation tools
  • 7. Reuse of existing NER tools • Simple evaluation of – Apache OpenNLP – Stanford CoreNLP – GATE • Choice of using Stanford CoreNLP since – Java-based (thread safe, scalable) – Good performance (f-measure) – Strong and active community – Rather robust against noisy input (CRF)
  • 8. Approach • Adaptation of Stanford CoreNLP by the KB National Library of the Netherlands to directly consume ENMAP (= Europeana Newspapers METS/ALTO profile) objects
  • 9. Approach • Export option ALTO v3 with tags added <String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"> </String> <String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0" VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"> </String> … <Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/> </Tags>
  • 10. Annotation • Quick evaluation of annotation tools: – BRAT – WebANNO – INL Attestation Tool • Choice of INL Attestation Tool since: – Optimized for tagging speed – Supported by consortium partner (INL/IVDNT)
  • 11. Corpus creation • Selection of 100 pages each per language • Processing of the OCRed texts with StanfordNER to get initial tagging results • Manual verification and annotation
  • 12. Corpus statistics Language # tokens # PER # LOC # ORG French 207,000 5,672 5,614 2,574 Dutch 182,483 4,492 4,448 1,160 German 96,735 7,914 6,143 2,784 Language # tokens # PER # LOC # ORG French 100% 2,75% 2,71% 1,24% Dutch 100% 2,46% 2,44% 0,64% German 100% 8,18% 6,35% 2,88% Language Word-Error-Rate (Bag of Words) Reading Order Success Rate French 16,6% 19,9% Dutch 17,6% 23,2% German 15,9% / 21,9% 13,6%
  • 17. Evaluation DE • A Named Entity Recognition Shootout for German M. Riedl and S. Padó. Proceedings of ACL, Melbourne, Australia, (2018).To appear.
  • 18. NER vs OCR success rate 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 NER OCR
  • 20. Improving performance • Possible additional features – Distributional similarity (Clark 2003) – Semantic generalization (Faruqui & Padò 2010) – Word embeddings (Braune 2017) • Gazetteers – Person names, historical place names • Data cleanup and improvement – https://github.com/EuropeanaNewspapers/ ner-corpora/wiki
  • 21. Trias NER • Combination and voting of different NER classifiers, e.g. – Stanford CoreNLP – Spacy – NLTK • Inspiration: https://github.com/KBNLresearch/Trias_NER
  • 22. Disambiguation • Disambiguation of person and place names • Inspiration: https://github.com/KBNLresearch/europeana np-dbpedia-disambiguation
  • 23. Linking • Linking of recognised and disambiguated NE‘s to authority files (e.g. Wikidata, GND) • Inspiration: https://github.com/KBNLresearch/dac