SlideShare une entreprise Scribd logo
1  sur  32
Europeana Newspapers -
Metadata
Ankara, 3rd
May 2013
Günter Mühlberger, Innsbruck University
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Agenda
•Introduction
•General considerations on metadata
•Metadata and newspaper digitisation
•EU Newspaper Project - Profile
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Introduction
•Innsbruck University
•Digitisation and Digital Preservation Group
• Since 1995 involved in Digital Library Projects
• Coordinated several EU R&D projects, currently 8 FTEs
• Introduced ALTO (Analyzed Layout and Text Object) to the
libraries community in 2002
• Fostered Optical Character Recognition for blackletter fonts
(Gothic – Fraktur) in 2004
• Initiated and coordinated E-Books on Demand Network (EOD)
• Member of the Executive Board of the IMPACT Project 2008-2012
(Large scale project for mass-digitisation and text recognition)
• Development of rule based document understanding platform
3
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Introduction
•Digitisation and full-text recognition
• Several projects since 1995, all with OCR processing
• Newspaper clippings (650.000 clippings)
• Index cards from libraries (31 catalogues, several millions of
cards)
• German dissertations (215.000 dissertations, 24 mill. pages)
•Currently
• Three EU Projects, among them partner of FP7 tranScriptorium
project (=Handwritten Text Recognition)
• OCR processing of 8 mill. newspaper pages for EU Newspaper
• Digitisation of the regional newspaper from Tyrol/Austria
4
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
General considerations
• Definition
• “Data about data”
• Example 1
• Data: I am taking part in this event which is a workshop of the EU Newspaper
project in Ankara. The event is currently going on.
• Metadata: On May, 3rd
Günter Mühlberger took part in the EU Newspaper
Workshop in Ankara.
• Example 2
• Data: We are digitising a newspaper. We cut the binding and use a document
scanner and we produce digital image files.
• Metadata: A Kodak i620 Scanner with 24 bit colour information and automatic
document feeder provides JPEG files with low compression (90% information
from original file is kept)
5
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
General considerations
• Example 3
• Data: A service provider delivers the scanned images from a newspaper to
the library via a hard disc.
• Metadata: A Windows File System, where a root directory must be found with
the identity number of the newspaper, subdirectories with years and further
subdirectories on issue/day level. An XML file is expected on root level with
metadata on the files.
• Observations
• With metadata we are introducing a “new view” on data
• Metadata are like a summary or a table of contents of data than “new” data
• Often they are implicit and people will say “This is clear to us anyway”.
• Metadata need some conventions and agreements
• We can produce metadata on metadata
6
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Analogue vs. digital library world
•Can books survive without libraries (or more generally
without an organisation that takes care about them)?
•Can books survive without index cards?
• YES!
•Why?
• Most often they contain their own library card = title page
• They can be read and understood by human beings
• Their physical condition is rather stable as long as they are stored
in a dry environment and no disaster takes place (fire, water,...)
7
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Analogue vs. digital
• Can digital works survive without libraries (or organisations taking
care about them?)
• Can digital works survive without index card?
• NO!
• Why?
• Digital data need a technical system to keep them alive. If the technical
system is away, also the digital data are gone. E.g. a world without
electricity for e.g. 20 years would lead to heavy data loss.
• Digital data cannot be read by human beings – we need a device to make
them visible for us.
8
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Some other differencies
• Is a machine able to read a book?
• NO!
• Is a machine able to read a digital book?
• YES and NO (only first attempts)
• YES, it is able to automatically process the document, to extract
the content, to index it, to print it out, to publish it on the Internet,
etc.
• NO, not in the sense of a human being who will understand the
content of a book, but already in the sense that the machine will
understand a lot of the content, e.g. person names, institutions,
geographical names, etc.)
9
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Observations
•Metadata are data about data.
•This game can be played several times.
•Metadata are structuring unstructured information.
•Metadata are helpful if they appear directly nearby the data.
•Metadata are especially important for digital data since
digital data are invisible for human beings.
•Digital metadata can be understood by machines, analogue
metadata not.
10
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Consequences
•Good metadata must
• record data that are helpful for the two main tasks of libraries:
preservation and access
• structure data in a meaningful way
• be readable to human beings
• be readable to machines
• be acknowledged and maintained by the community
• be available in explicit form with explanations, examples and
guidelines
11
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Metadata in a (newspaper) digitisation project
•Analogue material
•Digitisation process
•Text recognition process
•Structural enhancement
•File naming and structuring
•Ownership and Intellectual Property rights
•Digital provenance
•Intellectual substance
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Analogue newspaper
•“Typical metadata”
• Start and (maybe) end date of a newspaper
• Place of publication
• Titles and variations
• Publishers and editors
• Frequency of publication
• Language
• Material aspects, such as size of the paper
•Does a digitisation project need to recapture all this
information?
•Library catalogue is the authoritative source for this
information – a link (Identifier) the obvious solution
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Analogue newspaper
•Most of these data are within library catalogues, but a lot is
also missing:
• E.g.: very rarely a complete directory of all newspaper issues is
available, or for all special editions (e.g. for historical events, etc.),
supplements, etc.
• Also missing issues, or missing pages are very rarely recorded
• Newspaper digitisation project managers always make the
experience that with their work the first complete edition (including
also gaps) will be available
•The natural structure is therefore the issue and this is one of
the very important metadata also within EU Newspaper
• We expect all files structured according to issues/days (not to
volumes or years)
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Digitisation process
•A lot of artefacts: Image captured by the scanning system is
quite different from images finally stored. Internal processes,
enhancement software, deskewing, cropping, etc.
•Which metadata need to be kept?
•Type of scanner used?
• For a whole run, for single pages?
• Cameras often replace now scanners: They provide typically a lot
of metadata (EXIF) but the resolution is a problem (distance must
be known!)
•Microfilm
• Cameras!
15
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Optical Character Recognition (OCR)
•Has a long history, but many libraries were (and often are)
sceptical
•OCR data are produced automatically, will have errors
•The error rate will differ on the type of printing, the age of the
newspaper, the way it has been scanned, etc. but also on
the software used, the version and the parameters of the
software
•OCR engines provide not only text, but also information on
the layout of a page: e.g. coordinates of words
•OCR data may be corrected in parts, e.g. title of an article,
but not the body of the full-text
•Also crowd sourcing may play a role
16
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Structural enhancement
• The physical unit of a newspaper are the issue and the single
page.
• But for the reader the natural unit is the article – e.g. a piece of
content.
• Articles may consist of titles, subtitles, leads, photos connected
with them, caption lines, etc.
• But apart from articles we will find also announcements,
advertisements, charts, weather reports, tables with stock notes,
etc.
• Structural enhancement may be done completely automatically
(as OCR) but in most cases a manual process (outsourced)
• Time consuming – expensive and therefore especially important to
know what has been done with which accuracy, etc.
17
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Intellectual property rights
•Several groups/persons have IP rights on newspapers
•The publisher/editor
• Responsible for the whole newspaper – usually distribution rights
stay with the publisher
•Journalists as permanent staff of the newspaper
• Copyright will stay with the journalists but access rights usually
belong to the newspaper publisher
•Free lancers
• Especially photgraphers, IP Rights stay with them
18
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
IP Rights
• Problem of “orphan works”
• Digitisation and making available of a newspaper is a new kind of usage
and not covered by “old” contracts
• This IP Right stays therefore in principle with the copyright owners
• But it is impossible to find out all IP Right holder of an old newspaper
• EU Directive on Orphan Works
• Puts libraries in a privileged position: Will be allowed to digitised orphan
works under special conditions
• Required to make a diligent search
• To document this search and to register the item in question
• To remunerate right holders who will return
• New kind of metadata are needed as well
19
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Digital provenance
•What happens with the digital data when time goes by?
•E.g. migration of one image format to the other?
•Or update of OCR data?
•Or further structural enhancement with the support of users
(=crowd)?
•Data need to stay coherent and changes should be
transparent
20
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Intellectual substance
•“The Times”, “Le Monde”, “Frankfurter Allgemeine Zeitung”,
“Washington Post”, etc.
•Publishers, well-known journalists, famous articles and
headlines, history of journalism, etc.
•Political attitude
•Layout, structuring and general appearance
•Objective must be: A holistic understanding of a newspaper
taking into account all aspects which we have mentioned
before.
•Datamining technology
21
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Main challenge: Putting it all together
22
Intellec-
tual sub-
stance
Digital
prove-
nance
Analoge
news-
paper
Enhance
-ment
IP rights
Digiti-
sation
process
File
naming
OCR and
enhance-
ment
???
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
EU Newspaper ENMAP
•Europeana Newspaper METS ALTO Profile
•Objective
• Provide a robust metadata model for the digitisation of
newspapers that can be used by libraries for preservation, access
and interoperability (delivery of data to Europeana)
•Roadmap
• Set up an internal format until M12
• Implement it within the project and deliver information packages to
Europeana according to this format
• Extend the internal format and make it more general so that it can
easily be used outside the project as well (until M18/July 2013)
23
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP – Main approach
•METS (Metadata Encoding and Transmission Standard)
• Library of Congress
• Open format: Editorial board for maintaining the format
• Container format: Provides a frame for all kinds of metadata
• Goes back to Making of America II (=late 90ies digitisation project)
• XML Format (readable for machines as well as human beings –
with a simple text editor)
• THE dominant format in the libraries world for digitisation projects
• Within OAIS (Open Archive Information System) it serves as an
Information Package (AIP, SIP, DIP)
24
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
METS
•Cons of METS
• The standardisation level is rather low: Profiles are needed to
specify the actual usage of the format
• It introduces an extra complexity to digitisation projects
• The role of METS within the digital preservation process is not
always clear: Is it just for delivery of data (e.g. Submission
Information Package) or for “real” preservation, e.g. as Archival
Information Package (AIP)?
•My personal opinion
• A rich METS file as AIP together with all content data (=images,
OCR files, etc.) on a storage server are good prerequisites for
digital preservation
25
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP
•Usage of METS within EU Newspaper
• We use METS as a submission format (SIP) for delivery data from
the libraries and the enhancement processes (OCR, structural
enhancement) to Europeana
• But we also will provide a format that may serve as Archival
Information Package as well – as a concept.
•Descriptive metadata
• Are kept in MODS (Metadata Object Description Schema)
• Dublin Core would have been an alternative, but MODS is richer
and there is a relationship to MARC21
•Technical data
• MIX (Metadata Metadata for Images in XML Standard)
• Extended format
26
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP
• OCR Data
• ALTO (Analyzed Layout and Text Object)
• Are kept “outside” METS in XML files (one file per image) but are linked
• ALTO allows to store not only text, but also information from the OCR
engine, such as coordinates of blocks, lines, words as well as type of
blocks, e.g. text or pictures
• Connection between image and text is important for e.g. producing PDFs,
or highlighting search results on an image or for further enhancement
• One of the main achievements of the IMPACT project was to convince
industry to provide native ALTO export, e.g. ABBYY FineReader
• Therefore rather simple to produce and the effect on standardisation
should be rather high
27
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP
•Structural enhancement
• METS provides a “structural map” that allows to manage single
articles which may come from an enhancement process
• E.g. titles of articles, reading order of sections and pictures can be
recorded within METS and than linked to the image via the ALTO
files
• In this way it is possible to index or to display a single article
•Our ambition
• To contribute to the standardisation of structural data by providing
a data dictionary for structural enhancement.
• We believe that a clear structuring will support especially full-text
searching and further text mining
28
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
ENMAP
•IP Rights
• Currently there are no standards available
• Within ENMAP we state only the ownership of the library on the
digital files – but this must not be mixed up with the actual IP rights
•Digital provenance
• PREMIS (Preservation Metadata Information System)
• An attempt to provide a general framework for “events” within the
life cycle of digital objects
•Intellectual content
• Tagging of Named Entities is a first step into data mining
• No standards are currently available
29
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Make the format living!
•Is there a chance that libraries outside/after the EU
Newspaper project will use and take up ENMAP?
•We believe “yes”!
30
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Uptake of the format
•Reasons
• 10 Mill. pages of newspapers will be enhanced within EU
Newspaper, this means some hundred thousands of ENMAP
packages
• 12 libraries from all over Europe will receive data in ENMAP
• Europeana will naturally use this format for further integration of
newspaper information
• Software tools are available to support the process of generating
metadata as well as on validating and delivering the data
• A workflow is available for putting everything together and to
produce ENMAP packages
• Documentation and examples will be available
31
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Contribute
•Further steps
• You are invited to review ENMAP once it is out (during 2013)
• You will find a public version on the website of EU Newspaper:
•http://www.europeana-newspapers.eu/
•Thank you for your attention!
32

Contenu connexe

Tendances

The challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineLIBER Europe
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspaperscneudecker
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaEuropeana Newspapers
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspapers
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayEuropeana Newspapers
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Onlinecneudecker
 
Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectEuropeana Newspapers
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana Newspapers
 
Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...cneudecker
 
Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Europeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 
04 europeana newspapers
04 europeana newspapers04 europeana newspapers
04 europeana newspapersEuropeana
 

Tendances (20)

The challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available onlineThe challenges of making Europe's newspapers available online
The challenges of making Europe's newspapers available online
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza Atanassova
 
Europeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation PlanEuropeana Newspapers Aggregation Plan
Europeana Newspapers Aggregation Plan
 
Europeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop introEuropeana Newspapers LIBER2013 Workshop intro
Europeana Newspapers LIBER2013 Workshop intro
 
ENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilmsENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilms
 
Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013Europeana Newspaper metadata LIBER2013
Europeana Newspaper metadata LIBER2013
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Online
 
Overview of the Europeana Newspapers Project
Overview of the Europeana Newspapers ProjectOverview of the Europeana Newspapers Project
Overview of the Europeana Newspapers Project
 
Europeana Newspapers Polish Information Day
Europeana Newspapers Polish Information DayEuropeana Newspapers Polish Information Day
Europeana Newspapers Polish Information Day
 
Europeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLiederEuropeana_Newspapers_ONB_infoday_HJLieder
Europeana_Newspapers_ONB_infoday_HJLieder
 
EurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_NeudeckerEurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_Neudecker
 
Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...Large scale refinement of digital historical newspapers with named entities r...
Large scale refinement of digital historical newspapers with named entities r...
 
Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser Challenges and solutions in creating a european historic newspapers browser
Challenges and solutions in creating a european historic newspapers browser
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
04 europeana newspapers
04 europeana newspapers04 europeana newspapers
04 europeana newspapers
 

Similaire à Metadata

Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshellcneudecker
 
Update and forward plan for ENUMERATE - Digitisation intelligence for Europe
Update and forward plan for ENUMERATE - Digitisation intelligence for EuropeUpdate and forward plan for ENUMERATE - Digitisation intelligence for Europe
Update and forward plan for ENUMERATE - Digitisation intelligence for EuropeNicholas Poole
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerBiblioteca Nacional de España
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisationcneudecker
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
EDF2012 Stefano Bertolo - Future European activities and funding perspectiv...
EDF2012   Stefano Bertolo - Future European activities and funding perspectiv...EDF2012   Stefano Bertolo - Future European activities and funding perspectiv...
EDF2012 Stefano Bertolo - Future European activities and funding perspectiv...European Data Forum
 
EWRC 2017: Idea 1 - Community (Examples)
EWRC 2017: Idea 1 - Community (Examples)EWRC 2017: Idea 1 - Community (Examples)
EWRC 2017: Idea 1 - Community (Examples)Mathew Lowry
 
Positioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapePositioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapeLIBER Europe
 
EUDAT 3rd Conference: What's on the Horizon? - Kimmo Koski, Managing Director...
EUDAT 3rd Conference: What's on the Horizon? - Kimmo Koski, Managing Director...EUDAT 3rd Conference: What's on the Horizon? - Kimmo Koski, Managing Director...
EUDAT 3rd Conference: What's on the Horizon? - Kimmo Koski, Managing Director...EUDAT
 
Performance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentPerformance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentEuropeana Newspapers
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...European Data Forum
 

Similaire à Metadata (14)

Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshell
 
Update and forward plan for ENUMERATE - Digitisation intelligence for Europe
Update and forward plan for ENUMERATE - Digitisation intelligence for EuropeUpdate and forward plan for ENUMERATE - Digitisation intelligence for Europe
Update and forward plan for ENUMERATE - Digitisation intelligence for Europe
 
Centre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens NeudeckerCentre of Competence in digitisation. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
 
Experimental Workflow Development in Digitisation
Experimental Workflow Development in DigitisationExperimental Workflow Development in Digitisation
Experimental Workflow Development in Digitisation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
Living Lab Expo 12102012
Living Lab Expo 12102012Living Lab Expo 12102012
Living Lab Expo 12102012
 
Living Lab Expo ENoLL presentatio
Living Lab Expo ENoLL presentatioLiving Lab Expo ENoLL presentatio
Living Lab Expo ENoLL presentatio
 
EDF2012 Stefano Bertolo - Future European activities and funding perspectiv...
EDF2012   Stefano Bertolo - Future European activities and funding perspectiv...EDF2012   Stefano Bertolo - Future European activities and funding perspectiv...
EDF2012 Stefano Bertolo - Future European activities and funding perspectiv...
 
EWRC 2017: Idea 1 - Community (Examples)
EWRC 2017: Idea 1 - Community (Examples)EWRC 2017: Idea 1 - Community (Examples)
EWRC 2017: Idea 1 - Community (Examples)
 
Positioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscapePositioning libraries in the digital preservation landscape
Positioning libraries in the digital preservation landscape
 
EUDAT 3rd Conference: What's on the Horizon? - Kimmo Koski, Managing Director...
EUDAT 3rd Conference: What's on the Horizon? - Kimmo Koski, Managing Director...EUDAT 3rd Conference: What's on the Horizon? - Kimmo Koski, Managing Director...
EUDAT 3rd Conference: What's on the Horizon? - Kimmo Koski, Managing Director...
 
Performance Evaluation and Quality Assessment
Performance Evaluation and Quality AssessmentPerformance Evaluation and Quality Assessment
Performance Evaluation and Quality Assessment
 
Keynote: Stefano Bertolo
Keynote: Stefano BertoloKeynote: Stefano Bertolo
Keynote: Stefano Bertolo
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
 

Plus de Europeana Newspapers

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisEuropeana Newspapers
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayEuropeana Newspapers
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayEuropeana Newspapers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers
 

Plus de Europeana Newspapers (20)

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne Kouts
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista Aru
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred Puss
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday Neudecker
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday Thompson
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday Rossi
 
Enp lft infoday_neudecker
Enp lft infoday_neudeckerEnp lft infoday_neudecker
Enp lft infoday_neudecker
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday Messina
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday Marchetti
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday Kempf
 
Europeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday GenereuxEuropeana Newspapers LFT Infoday Genereux
Europeana Newspapers LFT Infoday Genereux
 
Europeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday BolioliEuropeana Newspapers LFT Infoday Bolioli
Europeana Newspapers LFT Infoday Bolioli
 
ENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillemsENP_Dutch_Infoday_MWillems
ENP_Dutch_Infoday_MWillems
 
ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen ENP_Dutch_Infoday_PHuijnen
ENP_Dutch_Infoday_PHuijnen
 

Metadata

  • 1. Europeana Newspapers - Metadata Ankara, 3rd May 2013 Günter Mühlberger, Innsbruck University
  • 2. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Agenda •Introduction •General considerations on metadata •Metadata and newspaper digitisation •EU Newspaper Project - Profile 2
  • 3. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Introduction •Innsbruck University •Digitisation and Digital Preservation Group • Since 1995 involved in Digital Library Projects • Coordinated several EU R&D projects, currently 8 FTEs • Introduced ALTO (Analyzed Layout and Text Object) to the libraries community in 2002 • Fostered Optical Character Recognition for blackletter fonts (Gothic – Fraktur) in 2004 • Initiated and coordinated E-Books on Demand Network (EOD) • Member of the Executive Board of the IMPACT Project 2008-2012 (Large scale project for mass-digitisation and text recognition) • Development of rule based document understanding platform 3
  • 4. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Introduction •Digitisation and full-text recognition • Several projects since 1995, all with OCR processing • Newspaper clippings (650.000 clippings) • Index cards from libraries (31 catalogues, several millions of cards) • German dissertations (215.000 dissertations, 24 mill. pages) •Currently • Three EU Projects, among them partner of FP7 tranScriptorium project (=Handwritten Text Recognition) • OCR processing of 8 mill. newspaper pages for EU Newspaper • Digitisation of the regional newspaper from Tyrol/Austria 4
  • 5. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp General considerations • Definition • “Data about data” • Example 1 • Data: I am taking part in this event which is a workshop of the EU Newspaper project in Ankara. The event is currently going on. • Metadata: On May, 3rd Günter Mühlberger took part in the EU Newspaper Workshop in Ankara. • Example 2 • Data: We are digitising a newspaper. We cut the binding and use a document scanner and we produce digital image files. • Metadata: A Kodak i620 Scanner with 24 bit colour information and automatic document feeder provides JPEG files with low compression (90% information from original file is kept) 5
  • 6. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp General considerations • Example 3 • Data: A service provider delivers the scanned images from a newspaper to the library via a hard disc. • Metadata: A Windows File System, where a root directory must be found with the identity number of the newspaper, subdirectories with years and further subdirectories on issue/day level. An XML file is expected on root level with metadata on the files. • Observations • With metadata we are introducing a “new view” on data • Metadata are like a summary or a table of contents of data than “new” data • Often they are implicit and people will say “This is clear to us anyway”. • Metadata need some conventions and agreements • We can produce metadata on metadata 6
  • 7. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Analogue vs. digital library world •Can books survive without libraries (or more generally without an organisation that takes care about them)? •Can books survive without index cards? • YES! •Why? • Most often they contain their own library card = title page • They can be read and understood by human beings • Their physical condition is rather stable as long as they are stored in a dry environment and no disaster takes place (fire, water,...) 7
  • 8. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Analogue vs. digital • Can digital works survive without libraries (or organisations taking care about them?) • Can digital works survive without index card? • NO! • Why? • Digital data need a technical system to keep them alive. If the technical system is away, also the digital data are gone. E.g. a world without electricity for e.g. 20 years would lead to heavy data loss. • Digital data cannot be read by human beings – we need a device to make them visible for us. 8
  • 9. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Some other differencies • Is a machine able to read a book? • NO! • Is a machine able to read a digital book? • YES and NO (only first attempts) • YES, it is able to automatically process the document, to extract the content, to index it, to print it out, to publish it on the Internet, etc. • NO, not in the sense of a human being who will understand the content of a book, but already in the sense that the machine will understand a lot of the content, e.g. person names, institutions, geographical names, etc.) 9
  • 10. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Observations •Metadata are data about data. •This game can be played several times. •Metadata are structuring unstructured information. •Metadata are helpful if they appear directly nearby the data. •Metadata are especially important for digital data since digital data are invisible for human beings. •Digital metadata can be understood by machines, analogue metadata not. 10
  • 11. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Consequences •Good metadata must • record data that are helpful for the two main tasks of libraries: preservation and access • structure data in a meaningful way • be readable to human beings • be readable to machines • be acknowledged and maintained by the community • be available in explicit form with explanations, examples and guidelines 11
  • 12. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Metadata in a (newspaper) digitisation project •Analogue material •Digitisation process •Text recognition process •Structural enhancement •File naming and structuring •Ownership and Intellectual Property rights •Digital provenance •Intellectual substance 12
  • 13. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Analogue newspaper •“Typical metadata” • Start and (maybe) end date of a newspaper • Place of publication • Titles and variations • Publishers and editors • Frequency of publication • Language • Material aspects, such as size of the paper •Does a digitisation project need to recapture all this information? •Library catalogue is the authoritative source for this information – a link (Identifier) the obvious solution 13
  • 14. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Analogue newspaper •Most of these data are within library catalogues, but a lot is also missing: • E.g.: very rarely a complete directory of all newspaper issues is available, or for all special editions (e.g. for historical events, etc.), supplements, etc. • Also missing issues, or missing pages are very rarely recorded • Newspaper digitisation project managers always make the experience that with their work the first complete edition (including also gaps) will be available •The natural structure is therefore the issue and this is one of the very important metadata also within EU Newspaper • We expect all files structured according to issues/days (not to volumes or years) 14
  • 15. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Digitisation process •A lot of artefacts: Image captured by the scanning system is quite different from images finally stored. Internal processes, enhancement software, deskewing, cropping, etc. •Which metadata need to be kept? •Type of scanner used? • For a whole run, for single pages? • Cameras often replace now scanners: They provide typically a lot of metadata (EXIF) but the resolution is a problem (distance must be known!) •Microfilm • Cameras! 15
  • 16. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Optical Character Recognition (OCR) •Has a long history, but many libraries were (and often are) sceptical •OCR data are produced automatically, will have errors •The error rate will differ on the type of printing, the age of the newspaper, the way it has been scanned, etc. but also on the software used, the version and the parameters of the software •OCR engines provide not only text, but also information on the layout of a page: e.g. coordinates of words •OCR data may be corrected in parts, e.g. title of an article, but not the body of the full-text •Also crowd sourcing may play a role 16
  • 17. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Structural enhancement • The physical unit of a newspaper are the issue and the single page. • But for the reader the natural unit is the article – e.g. a piece of content. • Articles may consist of titles, subtitles, leads, photos connected with them, caption lines, etc. • But apart from articles we will find also announcements, advertisements, charts, weather reports, tables with stock notes, etc. • Structural enhancement may be done completely automatically (as OCR) but in most cases a manual process (outsourced) • Time consuming – expensive and therefore especially important to know what has been done with which accuracy, etc. 17
  • 18. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Intellectual property rights •Several groups/persons have IP rights on newspapers •The publisher/editor • Responsible for the whole newspaper – usually distribution rights stay with the publisher •Journalists as permanent staff of the newspaper • Copyright will stay with the journalists but access rights usually belong to the newspaper publisher •Free lancers • Especially photgraphers, IP Rights stay with them 18
  • 19. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp IP Rights • Problem of “orphan works” • Digitisation and making available of a newspaper is a new kind of usage and not covered by “old” contracts • This IP Right stays therefore in principle with the copyright owners • But it is impossible to find out all IP Right holder of an old newspaper • EU Directive on Orphan Works • Puts libraries in a privileged position: Will be allowed to digitised orphan works under special conditions • Required to make a diligent search • To document this search and to register the item in question • To remunerate right holders who will return • New kind of metadata are needed as well 19
  • 20. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Digital provenance •What happens with the digital data when time goes by? •E.g. migration of one image format to the other? •Or update of OCR data? •Or further structural enhancement with the support of users (=crowd)? •Data need to stay coherent and changes should be transparent 20
  • 21. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Intellectual substance •“The Times”, “Le Monde”, “Frankfurter Allgemeine Zeitung”, “Washington Post”, etc. •Publishers, well-known journalists, famous articles and headlines, history of journalism, etc. •Political attitude •Layout, structuring and general appearance •Objective must be: A holistic understanding of a newspaper taking into account all aspects which we have mentioned before. •Datamining technology 21
  • 22. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Main challenge: Putting it all together 22 Intellec- tual sub- stance Digital prove- nance Analoge news- paper Enhance -ment IP rights Digiti- sation process File naming OCR and enhance- ment ???
  • 23. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp EU Newspaper ENMAP •Europeana Newspaper METS ALTO Profile •Objective • Provide a robust metadata model for the digitisation of newspapers that can be used by libraries for preservation, access and interoperability (delivery of data to Europeana) •Roadmap • Set up an internal format until M12 • Implement it within the project and deliver information packages to Europeana according to this format • Extend the internal format and make it more general so that it can easily be used outside the project as well (until M18/July 2013) 23
  • 24. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp ENMAP – Main approach •METS (Metadata Encoding and Transmission Standard) • Library of Congress • Open format: Editorial board for maintaining the format • Container format: Provides a frame for all kinds of metadata • Goes back to Making of America II (=late 90ies digitisation project) • XML Format (readable for machines as well as human beings – with a simple text editor) • THE dominant format in the libraries world for digitisation projects • Within OAIS (Open Archive Information System) it serves as an Information Package (AIP, SIP, DIP) 24
  • 25. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp METS •Cons of METS • The standardisation level is rather low: Profiles are needed to specify the actual usage of the format • It introduces an extra complexity to digitisation projects • The role of METS within the digital preservation process is not always clear: Is it just for delivery of data (e.g. Submission Information Package) or for “real” preservation, e.g. as Archival Information Package (AIP)? •My personal opinion • A rich METS file as AIP together with all content data (=images, OCR files, etc.) on a storage server are good prerequisites for digital preservation 25
  • 26. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp ENMAP •Usage of METS within EU Newspaper • We use METS as a submission format (SIP) for delivery data from the libraries and the enhancement processes (OCR, structural enhancement) to Europeana • But we also will provide a format that may serve as Archival Information Package as well – as a concept. •Descriptive metadata • Are kept in MODS (Metadata Object Description Schema) • Dublin Core would have been an alternative, but MODS is richer and there is a relationship to MARC21 •Technical data • MIX (Metadata Metadata for Images in XML Standard) • Extended format 26
  • 27. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp ENMAP • OCR Data • ALTO (Analyzed Layout and Text Object) • Are kept “outside” METS in XML files (one file per image) but are linked • ALTO allows to store not only text, but also information from the OCR engine, such as coordinates of blocks, lines, words as well as type of blocks, e.g. text or pictures • Connection between image and text is important for e.g. producing PDFs, or highlighting search results on an image or for further enhancement • One of the main achievements of the IMPACT project was to convince industry to provide native ALTO export, e.g. ABBYY FineReader • Therefore rather simple to produce and the effect on standardisation should be rather high 27
  • 28. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp ENMAP •Structural enhancement • METS provides a “structural map” that allows to manage single articles which may come from an enhancement process • E.g. titles of articles, reading order of sections and pictures can be recorded within METS and than linked to the image via the ALTO files • In this way it is possible to index or to display a single article •Our ambition • To contribute to the standardisation of structural data by providing a data dictionary for structural enhancement. • We believe that a clear structuring will support especially full-text searching and further text mining 28
  • 29. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp ENMAP •IP Rights • Currently there are no standards available • Within ENMAP we state only the ownership of the library on the digital files – but this must not be mixed up with the actual IP rights •Digital provenance • PREMIS (Preservation Metadata Information System) • An attempt to provide a general framework for “events” within the life cycle of digital objects •Intellectual content • Tagging of Named Entities is a first step into data mining • No standards are currently available 29
  • 30. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Make the format living! •Is there a chance that libraries outside/after the EU Newspaper project will use and take up ENMAP? •We believe “yes”! 30
  • 31. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Uptake of the format •Reasons • 10 Mill. pages of newspapers will be enhanced within EU Newspaper, this means some hundred thousands of ENMAP packages • 12 libraries from all over Europe will receive data in ENMAP • Europeana will naturally use this format for further integration of newspaper information • Software tools are available to support the process of generating metadata as well as on validating and delivering the data • A workflow is available for putting everything together and to produce ENMAP packages • Documentation and examples will be available 31
  • 32. This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp Contribute •Further steps • You are invited to review ENMAP once it is out (during 2013) • You will find a public version on the website of EU Newspaper: •http://www.europeana-newspapers.eu/ •Thank you for your attention! 32