2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Agenda
•Introduction
•General considerations on metadata
•Metadata and newspaper digitisation
•EU Newspaper Project - Profile
2
3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Introduction
•Innsbruck University
•Digitisation and Digital Preservation Group
• Since 1995 involved in Digital Library Projects
• Coordinated several EU R&D projects, currently 8 FTEs
• Introduced ALTO (Analyzed Layout and Text Object) to the
libraries community in 2002
• Fostered Optical Character Recognition for blackletter fonts
(Gothic – Fraktur) in 2004
• Initiated and coordinated E-Books on Demand Network (EOD)
• Member of the Executive Board of the IMPACT Project 2008-2012
(Large scale project for mass-digitisation and text recognition)
• Development of rule based document understanding platform
3
4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Introduction
•Digitisation and full-text recognition
• Several projects since 1995, all with OCR processing
• Newspaper clippings (650.000 clippings)
• Index cards from libraries (31 catalogues, several millions of
cards)
• German dissertations (215.000 dissertations, 24 mill. pages)
•Currently
• Three EU Projects, among them partner of FP7 tranScriptorium
project (=Handwritten Text Recognition)
• OCR processing of 8 mill. newspaper pages for EU Newspaper
• Digitisation of the regional newspaper from Tyrol/Austria
4
5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
General considerations
• Definition
• “Data about data”
• Example 1
• Data: I am taking part in this event which is a workshop of the EU Newspaper
project in Ankara. The event is currently going on.
• Metadata: On May, 3rd
Günter Mühlberger took part in the EU Newspaper
Workshop in Ankara.
• Example 2
• Data: We are digitising a newspaper. We cut the binding and use a document
scanner and we produce digital image files.
• Metadata: A Kodak i620 Scanner with 24 bit colour information and automatic
document feeder provides JPEG files with low compression (90% information
from original file is kept)
5
6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
General considerations
• Example 3
• Data: A service provider delivers the scanned images from a newspaper to
the library via a hard disc.
• Metadata: A Windows File System, where a root directory must be found with
the identity number of the newspaper, subdirectories with years and further
subdirectories on issue/day level. An XML file is expected on root level with
metadata on the files.
• Observations
• With metadata we are introducing a “new view” on data
• Metadata are like a summary or a table of contents of data than “new” data
• Often they are implicit and people will say “This is clear to us anyway”.
• Metadata need some conventions and agreements
• We can produce metadata on metadata
6
7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Analogue vs. digital library world
•Can books survive without libraries (or more generally
without an organisation that takes care about them)?
•Can books survive without index cards?
• YES!
•Why?
• Most often they contain their own library card = title page
• They can be read and understood by human beings
• Their physical condition is rather stable as long as they are stored
in a dry environment and no disaster takes place (fire, water,...)
7
8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Analogue vs. digital
• Can digital works survive without libraries (or organisations taking
care about them?)
• Can digital works survive without index card?
• NO!
• Why?
• Digital data need a technical system to keep them alive. If the technical
system is away, also the digital data are gone. E.g. a world without
electricity for e.g. 20 years would lead to heavy data loss.
• Digital data cannot be read by human beings – we need a device to make
them visible for us.
8
9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Some other differencies
• Is a machine able to read a book?
• NO!
• Is a machine able to read a digital book?
• YES and NO (only first attempts)
• YES, it is able to automatically process the document, to extract
the content, to index it, to print it out, to publish it on the Internet,
etc.
• NO, not in the sense of a human being who will understand the
content of a book, but already in the sense that the machine will
understand a lot of the content, e.g. person names, institutions,
geographical names, etc.)
9
10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Observations
•Metadata are data about data.
•This game can be played several times.
•Metadata are structuring unstructured information.
•Metadata are helpful if they appear directly nearby the data.
•Metadata are especially important for digital data since
digital data are invisible for human beings.
•Digital metadata can be understood by machines, analogue
metadata not.
10
11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Consequences
•Good metadata must
• record data that are helpful for the two main tasks of libraries:
preservation and access
• structure data in a meaningful way
• be readable to human beings
• be readable to machines
• be acknowledged and maintained by the community
• be available in explicit form with explanations, examples and
guidelines
11
12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Metadata in a (newspaper) digitisation project
•Analogue material
•Digitisation process
•Text recognition process
•Structural enhancement
•File naming and structuring
•Ownership and Intellectual Property rights
•Digital provenance
•Intellectual substance
12
13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Analogue newspaper
•“Typical metadata”
• Start and (maybe) end date of a newspaper
• Place of publication
• Titles and variations
• Publishers and editors
• Frequency of publication
• Language
• Material aspects, such as size of the paper
•Does a digitisation project need to recapture all this
information?
•Library catalogue is the authoritative source for this
information – a link (Identifier) the obvious solution
13
14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Analogue newspaper
•Most of these data are within library catalogues, but a lot is
also missing:
• E.g.: very rarely a complete directory of all newspaper issues is
available, or for all special editions (e.g. for historical events, etc.),
supplements, etc.
• Also missing issues, or missing pages are very rarely recorded
• Newspaper digitisation project managers always make the
experience that with their work the first complete edition (including
also gaps) will be available
•The natural structure is therefore the issue and this is one of
the very important metadata also within EU Newspaper
• We expect all files structured according to issues/days (not to
volumes or years)
14
15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Digitisation process
•A lot of artefacts: Image captured by the scanning system is
quite different from images finally stored. Internal processes,
enhancement software, deskewing, cropping, etc.
•Which metadata need to be kept?
•Type of scanner used?
• For a whole run, for single pages?
• Cameras often replace now scanners: They provide typically a lot
of metadata (EXIF) but the resolution is a problem (distance must
be known!)
•Microfilm
• Cameras!
15
16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Optical Character Recognition (OCR)
•Has a long history, but many libraries were (and often are)
sceptical
•OCR data are produced automatically, will have errors
•The error rate will differ on the type of printing, the age of the
newspaper, the way it has been scanned, etc. but also on
the software used, the version and the parameters of the
software
•OCR engines provide not only text, but also information on
the layout of a page: e.g. coordinates of words
•OCR data may be corrected in parts, e.g. title of an article,
but not the body of the full-text
•Also crowd sourcing may play a role
16
17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Structural enhancement
• The physical unit of a newspaper are the issue and the single
page.
• But for the reader the natural unit is the article – e.g. a piece of
content.
• Articles may consist of titles, subtitles, leads, photos connected
with them, caption lines, etc.
• But apart from articles we will find also announcements,
advertisements, charts, weather reports, tables with stock notes,
etc.
• Structural enhancement may be done completely automatically
(as OCR) but in most cases a manual process (outsourced)
• Time consuming – expensive and therefore especially important to
know what has been done with which accuracy, etc.
17
18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Intellectual property rights
•Several groups/persons have IP rights on newspapers
•The publisher/editor
• Responsible for the whole newspaper – usually distribution rights
stay with the publisher
•Journalists as permanent staff of the newspaper
• Copyright will stay with the journalists but access rights usually
belong to the newspaper publisher
•Free lancers
• Especially photgraphers, IP Rights stay with them
18
19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
IP Rights
• Problem of “orphan works”
• Digitisation and making available of a newspaper is a new kind of usage
and not covered by “old” contracts
• This IP Right stays therefore in principle with the copyright owners
• But it is impossible to find out all IP Right holder of an old newspaper
• EU Directive on Orphan Works
• Puts libraries in a privileged position: Will be allowed to digitised orphan
works under special conditions
• Required to make a diligent search
• To document this search and to register the item in question
• To remunerate right holders who will return
• New kind of metadata are needed as well
19
20. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Digital provenance
•What happens with the digital data when time goes by?
•E.g. migration of one image format to the other?
•Or update of OCR data?
•Or further structural enhancement with the support of users
(=crowd)?
•Data need to stay coherent and changes should be
transparent
20
21. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Intellectual substance
•“The Times”, “Le Monde”, “Frankfurter Allgemeine Zeitung”,
“Washington Post”, etc.
•Publishers, well-known journalists, famous articles and
headlines, history of journalism, etc.
•Political attitude
•Layout, structuring and general appearance
•Objective must be: A holistic understanding of a newspaper
taking into account all aspects which we have mentioned
before.
•Datamining technology
21
22. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Main challenge: Putting it all together
22
Intellec-
tual sub-
stance
Digital
prove-
nance
Analoge
news-
paper
Enhance
-ment
IP rights
Digiti-
sation
process
File
naming
OCR and
enhance-
ment
???
23. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
EU Newspaper ENMAP
•Europeana Newspaper METS ALTO Profile
•Objective
• Provide a robust metadata model for the digitisation of
newspapers that can be used by libraries for preservation, access
and interoperability (delivery of data to Europeana)
•Roadmap
• Set up an internal format until M12
• Implement it within the project and deliver information packages to
Europeana according to this format
• Extend the internal format and make it more general so that it can
easily be used outside the project as well (until M18/July 2013)
23
24. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP – Main approach
•METS (Metadata Encoding and Transmission Standard)
• Library of Congress
• Open format: Editorial board for maintaining the format
• Container format: Provides a frame for all kinds of metadata
• Goes back to Making of America II (=late 90ies digitisation project)
• XML Format (readable for machines as well as human beings –
with a simple text editor)
• THE dominant format in the libraries world for digitisation projects
• Within OAIS (Open Archive Information System) it serves as an
Information Package (AIP, SIP, DIP)
24
25. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
METS
•Cons of METS
• The standardisation level is rather low: Profiles are needed to
specify the actual usage of the format
• It introduces an extra complexity to digitisation projects
• The role of METS within the digital preservation process is not
always clear: Is it just for delivery of data (e.g. Submission
Information Package) or for “real” preservation, e.g. as Archival
Information Package (AIP)?
•My personal opinion
• A rich METS file as AIP together with all content data (=images,
OCR files, etc.) on a storage server are good prerequisites for
digital preservation
25
26. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP
•Usage of METS within EU Newspaper
• We use METS as a submission format (SIP) for delivery data from
the libraries and the enhancement processes (OCR, structural
enhancement) to Europeana
• But we also will provide a format that may serve as Archival
Information Package as well – as a concept.
•Descriptive metadata
• Are kept in MODS (Metadata Object Description Schema)
• Dublin Core would have been an alternative, but MODS is richer
and there is a relationship to MARC21
•Technical data
• MIX (Metadata Metadata for Images in XML Standard)
• Extended format
26
27. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP
• OCR Data
• ALTO (Analyzed Layout and Text Object)
• Are kept “outside” METS in XML files (one file per image) but are linked
• ALTO allows to store not only text, but also information from the OCR
engine, such as coordinates of blocks, lines, words as well as type of
blocks, e.g. text or pictures
• Connection between image and text is important for e.g. producing PDFs,
or highlighting search results on an image or for further enhancement
• One of the main achievements of the IMPACT project was to convince
industry to provide native ALTO export, e.g. ABBYY FineReader
• Therefore rather simple to produce and the effect on standardisation
should be rather high
27
28. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP
•Structural enhancement
• METS provides a “structural map” that allows to manage single
articles which may come from an enhancement process
• E.g. titles of articles, reading order of sections and pictures can be
recorded within METS and than linked to the image via the ALTO
files
• In this way it is possible to index or to display a single article
•Our ambition
• To contribute to the standardisation of structural data by providing
a data dictionary for structural enhancement.
• We believe that a clear structuring will support especially full-text
searching and further text mining
28
29. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP
•IP Rights
• Currently there are no standards available
• Within ENMAP we state only the ownership of the library on the
digital files – but this must not be mixed up with the actual IP rights
•Digital provenance
• PREMIS (Preservation Metadata Information System)
• An attempt to provide a general framework for “events” within the
life cycle of digital objects
•Intellectual content
• Tagging of Named Entities is a first step into data mining
• No standards are currently available
29
30. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Make the format living!
•Is there a chance that libraries outside/after the EU
Newspaper project will use and take up ENMAP?
•We believe “yes”!
30
31. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Uptake of the format
•Reasons
• 10 Mill. pages of newspapers will be enhanced within EU
Newspaper, this means some hundred thousands of ENMAP
packages
• 12 libraries from all over Europe will receive data in ENMAP
• Europeana will naturally use this format for further integration of
newspaper information
• Software tools are available to support the process of generating
metadata as well as on validating and delivering the data
• A workflow is available for putting everything together and to
produce ENMAP packages
• Documentation and examples will be available
31
32. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Contribute
•Further steps
• You are invited to review ENMAP once it is out (during 2013)
• You will find a public version on the website of EU Newspaper:
•http://www.europeana-newspapers.eu/
•Thank you for your attention!
32