An Atoll Futures Research Institute? Presentation for CANCC
An Open Corpus for Named Entity Recognition in Historic Newspapers
1. An Open Corpus for Named Entity
Recognition in Historic Newspapers
Clemens Neudecker
Berlin State Library
@cneudecker
LREC2016, 23-28 May 2016, Portorož, Slovenia
2. Background
• Europeana Newspapers EU-project:
www.europeana-newspapers.eu
• OCRed 12m pages of historic newspapers
from Europe (an estimated 25 billion words!)
• Newspaper content from 23 libraries, in 40
languages, covering 4 centuries (1618-1990)
• Public domain full-text available for download
per language/content provider
3. Formats & Standards
• Full-text produced in ALTO
• Metadata (structural) in METS
• Metadata (bibliographic) in EDM
• Not a fan of XML?
Good ol‘ plain text (UTF-8) is also available…
research.europeana.eu/itemtype/newspapers
• Currently working on:
– API for text/search
– API for images (IIIF)
4. Approach
• 3 languages selected for NER:
Dutch, German, French – in collab. with
• Content in these languages constitutes about
50% of the overall full-text in the collection
5. Methodology
• Select 100 representative pages per language
– If a classifier already exists for given language –
run it on the selected 100 pages
– Ingest tagged/untagged pages to annotation tool
– Manually add/correct annotations
(>=2 librarians per language)
– Export and convert tagged data to BIO format
– Train classifier from BIO & gazetteers (if available)
– Evaluate derived classifier using 4-fold cross-eval
– Repeat until classification performance converges
6. NER software
• Tested Stanford NER, OpenNLP, NLTK, Gate
• Adaptation of Stanford NER package (CRF)
– Mature, well-documented, widely used
– Open source (GPL)
– Thread-safe & platform-independent (JVM)
– Machine learning scales out more easily
to multiple languages
– Prior experience working with CRF
7. NER encoding in ALTO
• In ALTO versions >2.1, this is possible:
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0"
VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5">
</String>
<String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"
VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10">
</String>
…
<Tags>
<NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/>
<NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/>
</Tags>
8. Annotation
• Evaluated BRAT, WebAnno, INL Attestation
• Reasons for selection of INL Attestation:
– Speed
– Support
of ALTO
format
– Support
from INL
available
9. Annotation stats
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
10. Challenges
• Clear, comprehensive & common guidelines
for manual annotation
• OCR quality – on average 80% word accuracy
• Wide variation in historical spelling
• Mix of languages on a single page
• Lack/loss of metadata on page/word level
• Some data corruption occured when ingesting
pre-tagged data into the annotation tool
11. Attempted workarounds
• Introduce OCR error patterns into training
data
actually yields less precision/recall
• Introduce a spelling variation module in the
NER classifier
rewrite rules (e.g. „frorn“ „from“)
high integration effort
requires reasonable amount of rules
abandoned due to high complexity
14. Use cases
• Improving search, information retrieval
– Within digital newspapers, a vast majority of
user queries are person and place names
• Linking of named entities to authority files
to create linked data
– The classification and disambiguation of named
entities allows the assignment of unique
identifiers from authorative sources – thus
enabling cross-language/cross-collection linking
15. Next steps
• Volunteers wanted!
Help correct corpus and collaboratively create a
free dataset – instructions on GitHub wiki:
– github.com/EuropeanaNewspapers/
ner-corpora/wiki/Corpus-cleanup
• Plans to improve performance:
– Add distributional similarity as feature (Clark 2003)
– Semantic generalisation (Faruqui & Padò 2010)
– Specialised gazetteers (e.g. list of historic place names)
– Data, data, data
16. Open resources
• European Newspapers NER dataset (CC0):
– github.com/EuropeanaNewspapers/ner-corpora
• Europeana Newspapers NER software (EUPL):
– github.com/EuropeanaNewspapers/europeananp-
ner
– github.com/EuropeanaNewspapers/europeananp-
dbpedia-disambiguation
• Annotated ALTO files:
– lab.kbresearch.nl/static/html/eunews.html
17. References
• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:
Large scale refinement of digital historical
newspapers with named entity recognition
Proceedings of the IFLA Newspaper Section
Satellite Meeting, 2014, Geneva, Switzerland.
• Y. Mossalam, A. Abi-Haidar, J.G. Ganascia:
Unsupervised named entity recognition and
disambiguation: An application to old French
journals
Advances in Data Mining. Applications and
Theoretical Aspects, Springer LNCS, 2014.
18. Thank you for your attention!
Questions?
Clemens Neudecker
Berlin State Library
@cneudecker