An Open Corpus for Named Entity Recognition in Historic Newspapers

An Open Corpus for Named Entity
Recognition in Historic Newspapers
Clemens Neudecker
Berlin State Library
@cneudecker
LREC2016, 23-28 May 2016, Portorož, Slovenia

Background
• Europeana Newspapers EU-project:
www.europeana-newspapers.eu
• OCRed 12m pages of historic newspapers
from Europe (an estimated 25 billion words!)
• Newspaper content from 23 libraries, in 40
languages, covering 4 centuries (1618-1990)
• Public domain full-text available for download
per language/content provider

Formats & Standards
• Full-text produced in ALTO
• Metadata (structural) in METS
• Metadata (bibliographic) in EDM
• Not a fan of XML?
Good ol‘ plain text (UTF-8) is also available…
research.europeana.eu/itemtype/newspapers
• Currently working on:
– API for text/search
– API for images (IIIF)

Approach
• 3 languages selected for NER:
Dutch, German, French – in collab. with
• Content in these languages constitutes about
50% of the overall full-text in the collection

Methodology
• Select 100 representative pages per language
– If a classifier already exists for given language –
run it on the selected 100 pages
– Ingest tagged/untagged pages to annotation tool
– Manually add/correct annotations
(>=2 librarians per language)
– Export and convert tagged data to BIO format
– Train classifier from BIO & gazetteers (if available)
– Evaluate derived classifier using 4-fold cross-eval
– Repeat until classification performance converges

NER software
• Tested Stanford NER, OpenNLP, NLTK, Gate
• Adaptation of Stanford NER package (CRF)
– Mature, well-documented, widely used
– Open source (GPL)
– Thread-safe & platform-independent (JVM)
– Machine learning scales out more easily
to multiple languages
– Prior experience working with CRF

NER encoding in ALTO
• In ALTO versions >2.1, this is possible:
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0"
VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5">
</String>
<String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"
VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10">
</String>
…
<Tags>
<NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/>
<NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/>
</Tags>

Annotation
• Evaluated BRAT, WebAnno, INL Attestation
• Reasons for selection of INL Attestation:
– Speed
– Support
of ALTO
format
– Support
from INL
available

Annotation stats
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%

Challenges
• Clear, comprehensive & common guidelines
for manual annotation
• OCR quality – on average 80% word accuracy
• Wide variation in historical spelling
• Mix of languages on a single page
• Lack/loss of metadata on page/word level
• Some data corruption occured when ingesting
pre-tagged data into the annotation tool

Attempted workarounds
• Introduce OCR error patterns into training
data
 actually yields less precision/recall
• Introduce a spelling variation module in the
NER classifier
 rewrite rules (e.g. „frorn“  „from“)
 high integration effort
 requires reasonable amount of rules
 abandoned due to high complexity

Evaluation NL
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)

Evaluation FR
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)

Use cases
• Improving search, information retrieval
– Within digital newspapers, a vast majority of
user queries are person and place names
• Linking of named entities to authority files
to create linked data
– The classification and disambiguation of named
entities allows the assignment of unique
identifiers from authorative sources – thus
enabling cross-language/cross-collection linking

Next steps
• Volunteers wanted!
Help correct corpus and collaboratively create a
free dataset – instructions on GitHub wiki:
– github.com/EuropeanaNewspapers/
ner-corpora/wiki/Corpus-cleanup
• Plans to improve performance:
– Add distributional similarity as feature (Clark 2003)
– Semantic generalisation (Faruqui & Padò 2010)
– Specialised gazetteers (e.g. list of historic place names)
– Data, data, data

Open resources
• European Newspapers NER dataset (CC0):
– github.com/EuropeanaNewspapers/ner-corpora
• Europeana Newspapers NER software (EUPL):
– github.com/EuropeanaNewspapers/europeananp-
ner
– github.com/EuropeanaNewspapers/europeananp-
dbpedia-disambiguation
• Annotated ALTO files:
– lab.kbresearch.nl/static/html/eunews.html

References
• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:
Large scale refinement of digital historical
newspapers with named entity recognition
Proceedings of the IFLA Newspaper Section
Satellite Meeting, 2014, Geneva, Switzerland.
• Y. Mossalam, A. Abi-Haidar, J.G. Ganascia:
Unsupervised named entity recognition and
disambiguation: An application to old French
journals
Advances in Data Mining. Applications and
Theoretical Aspects, Springer LNCS, 2014.

Thank you for your attention!
Questions?
Clemens Neudecker
Berlin State Library
@cneudecker

An Open Corpus for Named Entity Recognition in Historic Newspapers

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (12)

Plus de cneudecker

Plus de cneudecker (20)

Dernier

Dernier (20)

An Open Corpus for Named Entity Recognition in Historic Newspapers