1. @maxkaiser
Text Mining in Cultural
Heritage: Challenges
Max Kaiser
Head of Research & Development
Austrian National Library
max.kaiser@onb.ac.at
@maxkaiser
Text & Data Mining in Europe
The Hague, Nov. 11, 2015https://www.flickr.com/photos/shanegorski/2694765397/
No, this is
not the
Austrian National Library …
7. @maxkaiser
digitisation
of the entire historical
book holdings of the
Austrian National Library
Austrian Books
Online
www.onb.ac.at/austrianbooksonline/
12. @maxkaiser
Austrian Newspapers Online
→Started 2003
→More than 14 mio pages digitized
→More than 10 mio pages OCR read within
several projects
→Structured by newspaper & publication
date
15. @maxkaiser
A short digression …
Full text data is already useful today
(without text mining)
Full text data will be even more useful in the
future (with text mining)
29. @maxkaiser
Hadoop-Cluster
→Comprehensive analysis with tools on the
entire ONB corpus
→Enquiries over the whole range of
materials in Austrian Books Online
→Scalable processing, eg. for text mining
→Processing of analyses on site
→User-driven analyses
31. @maxkaiser
325.000+ books
38TBdata:
105 Million pages
33 Billion running words
5 Billion lines of text
186 identified languages
0.5TBtext index size in 3 Solr Shards
Austrian Books
Online
Status
Oct. 2015
40. @maxkaiser
→ Difficult to integrate in stable (legacy) systems
→ Base technologies still evolving
→ Spread of languages / diachronic evolution of
languages makes analysis pipelines complex
→ (Noisy) OCR results as source for analyses
→ Data not stable but constantly being updated:
→ Images improve
→ OCR improves
43. @maxkaiser
→ Chunks of image
and text-blocks are
recognized
→ Content is split into
contiguous blocks of text
→ The yellow boxes are
recognized as graphic
content
Text data
51. @maxkaiser
→IP or otherwise restricted content
has to be processed on site
→(e.g. local Hadoop installation)
→Interfaces (APIs etc.) not yet available,
but will be in the near future
53. @maxkaiser
→ Text mining of OCRed historical texts is still
a research topic
→ R&D in this area is understood, however
still requires commitment of using results
in production
→ Requires interfacing between:
→R&D
→IT services
→Library systems
→ And: what do the users want?
55. @maxkaiser
→ “Integration with current infrastructure is
complex”
→ “Legal implications are risky”
→ “There is still too much R&D involved to be
integrated in production environment”
→ “There are no clear use cases”
→ “A shift from (human produced) meta-data to
machine-understanding primary data is too
fundamental”
→ “Text mining results are opaque (and statistics
are hard to grasp)”
57. @maxkaiser
→Topic modelling and NER should be
traversing institutional boundaries
→Text mining techniques cover a wide
spectrum of possible applications
→User driven approach applied to material
from multiple institutions
Q: Who are the main drivers (Europeana?,
institutional consortia?, the users?,
researchers?)
59. @maxkaiser
→ Lack of integration in library discovery systems
→ Limitations in available vendor systems
→ Difficulty to explain possible instabilities in results
→ results are expected to improve as corpora evolve
→ methodology, corpus and implementation are moving
targets
→ New and different resources required
→ IT infrastructures, tech- and research staff, text
mining competency
→ Progress is fast!
→ The expectation of even more potential in the future
delays investment in this future, now.
61. @maxkaiser
→Content discovery
→Recommender systems (for librarians, for
patrons)
→Focused user-driven aggregation of
content
→Cross referencing of NERs in materials
→Crowd-sourcing and user-driven
development