SlideShare une entreprise Scribd logo
1  sur  25
Word Occurrence Based Extraction
of Work Contributors from
Statements of Responsibility

Nuno Freire
The European Library

TPDL-2013
Valletta, September 2013
Overview
Statements of responsibility from library bibliographic data:
“French Canadian freely arranged by Katherine K. Davis”.
“ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by
Coop Himmelblau.”
“W. Lange, A.C. Zeven and N.G. Hogenboom, editors”
“Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de
Luis”

Extracting work contributors for use in a rights
infrastructure: ARROW
http://arrow-net.eu
Outline
 The context
• The ARROW rights infrastructure
• The use of national bibliographies in ARROW






The problem
The approach
Evaluation
Conclusion and future work
The ARROW rights
infrastructure
 ARROW aims to support mass digitisation projects
with automated ways to clear the rights of the books to
be digitised.
 To identify and clear the rights associated with a book
a complex process needs to be undertaken:
•
•
•
•
•

Determine the work(s) contained within the book
Identify all the other expressions of the same work(s)
Identify the publisher(s) and contributor(s) involved
Determine the dates of publication at work level
Determine whether that work(s), and not the book itself, is
still in commerce
• If necessary, obtain any licenses from the rights holders or
collective rights organizations
4
What is ARROW
 A rights infrastructure and system for the
identification of:
• Rights status
• In or out of copyright
• In or out of print / commercialised or not
• Rights
• Which rights are involved
• Right holders
• Authors
• Publishers
• How and where to clear the rights
• Orphan Works and their registration
5
Sources of Information in ARROW
 ARROW makes information available
from several sources:
• The European Library:
• National bibliographies - to identify the book and to
cluster it with all other books containing the same
intellectual work
• Virtual International Authority File - to better identify
the authors and support the identification of in copyright
works
• Books in Print database - to know if any of the books
concerned are actively commercialised by any publisher
• Reproduction Rights Organisation – to see if they know
or can trace the rightholders
6
The ARROW
Workflow
The Role of Libraries
The Role of Libraries
••NationalLibraries as Metadata Providers
National Libraries as Metadata Providers
••
Provide the National Bibliographies to The
Provide the National Bibliographies to The
European Library
European Library
The Role of The European Library (TEL)
The Role of The European Library (TEL)
•To match library requests with national bibliographies
•To match library requests with national bibliographies
••Identifyall other manifestations that potentially share
Identify all other manifestations that potentially share
intellectual work with a manifestation
intellectual work with a manifestation
••Tocreate a Work record: work metadata, manifestations,
To create a Work record: work metadata, manifestations,
contributors, etc.
contributors, etc.
The Role of Books-in-Print (BIP)
The Role of Books-in-Print (BIP)
••Toprovide data about in print/out of print status
To provide data about in print/out of print status
••Toprovide data about publishers
To provide data about publishers
•To add new manifestation records of the work
•To add new manifestation records of the work
The Role of Reproduction Rights Organisation (RRO)
The Role of Reproduction Rights Organisation (RRO)
•RROs as Metadata Provider
•RROs as Metadata Provider
••
To provide data about authors and publishers
To provide data about authors and publishers
••
To provide data about available licenses
To provide data about available licenses
…
…
Statements of responsibility

 These statements usually contain information
about authorship, editors, photographers,
translators, and others involved in creating the
work
 In printed books, the statement of responsibility
is typically present on the title page
• The statement of responsibility is transcribed by the cataloguer
exactly as it appears in the book
(according to Anglo-American Cataloguing Rules)
Examples of statements of
responsibility
“French Canadian freely arranged by Katherine K. Davis”.
“ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by
Coop Himmelblau.”
“W. Lange, A.C. Zeven and N.G. Hogenboom, editors”
“by Pamela and Neal Priestland”
“Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de
Luis”
The problem
 National bibliographies are reliable on
representing in structured form the first author of
a work
 But secondary contributors are often not
represented in structured form
 Secondary contributors may reside only within
the statements of responsibility
The approach
 To approach the problem as a Named Entity Recognition
task in text that may not be grammatically correct, thus
lacking lexical evidence
 Some requirements from the ARROW context
• Easily applicable to several languages
• The outcomes of the recognition task must be explainable

 Design decisions
• Exploring the structured data within national bibliographies
• By analysis of the frequency of word occurrences in names of
persons, and in other textual data
• Using word occurrence frequency allows to
• bypass the need for building training sets
• be able to provide simpler explanations of the name recognition
results
The process – pre-processing

 A pre-processing of each national
bibliography is performed:
• Word frequency is calculated
• The frequency values are normalized, for
independence on the size of the national bibliography
• The pre-processing results in four dictionaries:
•
•
•
•

Words in titles
Words in person’s surnames
Words in other parts of person’s names, than the surname
Words that appear in lowercase in person names
(such as “von” in German names, or “de” in Portuguese
names)

• The dictionaries contain the normalized frequency
associated the words
The process – bibliographic record
processing
 The named entity recognition is performed for a
record as follows:
• Statement of responsibility is tokenized
• The person names are recognized by comparing the
tokens with the dictionaries
• The recognized names are compared against the
names of the contributors present in the structured
fields of the record.
• If no similar name exists in the record, the contributor
is added to the record in a structured data field
The process – named entity
recognition
Possible token sequences used to locate person names:
(in Augmented Backus–Naur Form)

non-ambiguous-surname
/
(
initial /
non-ambiguous-first-name /
non-ambiguous-surname /
non-ambiguous-non-capitalized-name
)
*(initial / first-name / surname / non-capitalized-name)
surname

(more details on the definition of these tokens are included in the paper)
Evaluation data set

(size of bibliographies and evaluation samples)
National Bibliography
British Library
German National
Library
National Library of the
Netherlands
National Library of
Greece
Central Institute for the
Union Catalogue of
Italian Libraries
Royal Library of
Belgium

Total
records

Main
language

Evaluation sample
Statements of
responsibility

Referred
Persons

13.4 million

English

205

328

9.4 million

German

200

378

3.2 million

Dutch

200

335

0.4 million

Greek

297

379

12.4 million

Italian

224

297

203

387

1329

2104

1 million

French and
Dutch
Total:
Evaluation results
Exact match
metric

Dataset

Partial match
metric

Precision
British Library
German National Library
National Library of the
Netherlands
National Library of
Greece
Central Institute for the
Union Catalogue of
Italian Libraries
Royal Library of Belgium
Overall:

Recall

Precision

Recall

0.981
0.975

0.979
0.934

0.991
0.992

0.991
0.992

0.973

0.875

0.977

0.979

0.656

0.414

0.758

0.868

0.97

0.896

0.971

0.973

0.981
0.948

0.959
0.837

0.981
0.958

0.982
0.963
Evaluation results analysis

 The main causes of recognition errors:
• Foreign person names negatively affected recall
• Names of persons used in names of
organizations negatively affected precision
• Two persons with same surname mentioned
together negatively affected recall. As for
example:
• “hrsg. von Volker und Michael Kriegeskorte”
• “by Pamela and Neal Priestland”
Conclusions
 The approach performed reliably in most
languages and bibliographic datasets
• Datasets of at least one million records
• Precision and recall above 0.97 on all but one dataset

 The results obtained on the Greek national
bibliography were not satisfactory
• This dataset has distinct characteristics from the
others:
• smaller size,
• a different alphabet
• different language
• Further investigation of the Greek national
bibliography is necessary
Future work
 Evaluation of the impact of this solution on the
final results of the rights clearance process of
ARROW
 Building the dictionaries from comprehensive
source of names of persons
• Virtual International Authority File (VIAF)
• International Standard Name Identifier (ISNI)

 Further functionality:
• recognition of organization names
• recognition of the role of the recognized contributors
(illustrator, editor, etc.)

 Other application scenarios
• Functional Requirements for Bibliographic Records
• Resource Description and Access
Acknowledgments
 The European Library
• Marcela Strelcova, Chiara Latronico and
Eva Kralt-Yap

 Associazione Italiana Editori
 University of Innsbruck
 This work was partially supported by the
ARROWplus project, with co-funding by the
European Commission programme
eContentplus
Co-funded by the
Community
programme
eContentplus
T hank you
Questions or comments?
Contact:
Nuno Freire – nuno.freire@kb.nl

Contenu connexe

Similaire à Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility

Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...The European Library
 
Author Consolidation Across European National Bibliographies And Academic Dig...
Author Consolidation Across European National Bibliographies And Academic Dig...Author Consolidation Across European National Bibliographies And Academic Dig...
Author Consolidation Across European National Bibliographies And Academic Dig...Pedro Craggett
 
Multilingual presentation ifla 2013 08-19
Multilingual presentation ifla 2013 08-19Multilingual presentation ifla 2013 08-19
Multilingual presentation ifla 2013 08-19Janifer Gatenby
 
Know Your Library And Become Information Literate 2
Know Your Library And Become Information Literate 2Know Your Library And Become Information Literate 2
Know Your Library And Become Information Literate 23nrico
 
K2 elhanan adler_israelbibliographicdata
K2 elhanan adler_israelbibliographicdataK2 elhanan adler_israelbibliographicdata
K2 elhanan adler_israelbibliographicdataevaminerva
 
K2 elhanan adler_israelbibliographicdata
K2 elhanan adler_israelbibliographicdataK2 elhanan adler_israelbibliographicdata
K2 elhanan adler_israelbibliographicdataevaminerva
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014eswcsummerschool
 
Arlington high school new york spring 2015
Arlington high school   new york spring 2015Arlington high school   new york spring 2015
Arlington high school new york spring 2015k-baril
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryWilliam Ulate
 
In want of a dataset: Text Analysis and the VRC, Catherine D. Adams
In want of a dataset: Text Analysis and the VRC, Catherine D. AdamsIn want of a dataset: Text Analysis and the VRC, Catherine D. Adams
In want of a dataset: Text Analysis and the VRC, Catherine D. AdamsVisual Resources Association
 
From MARC to LOD: preparing Wellcome Library metadata for discovery on the We...
From MARC to LOD: preparing Wellcome Library metadata for discovery on the We...From MARC to LOD: preparing Wellcome Library metadata for discovery on the We...
From MARC to LOD: preparing Wellcome Library metadata for discovery on the We...CILIP MDG
 
Library mangement system for schools levels
Library mangement system for schools levelsLibrary mangement system for schools levels
Library mangement system for schools levelsLiaquat Rahoo
 
Writing seminar youngspeter spring 2015
Writing seminar   youngspeter spring 2015Writing seminar   youngspeter spring 2015
Writing seminar youngspeter spring 2015k-baril
 
Dissertation2013
Dissertation2013Dissertation2013
Dissertation2013catherineca
 
Arlington high school sixties spring 2015
Arlington high school   sixties spring 2015Arlington high school   sixties spring 2015
Arlington high school sixties spring 2015k-baril
 
Links and Entities: The Library Data Revolution
Links and Entities: The Library Data RevolutionLinks and Entities: The Library Data Revolution
Links and Entities: The Library Data RevolutionOCLC
 

Similaire à Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility (20)

Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...
 
Author Consolidation Across European National Bibliographies And Academic Dig...
Author Consolidation Across European National Bibliographies And Academic Dig...Author Consolidation Across European National Bibliographies And Academic Dig...
Author Consolidation Across European National Bibliographies And Academic Dig...
 
Multilingual presentation ifla 2013 08-19
Multilingual presentation ifla 2013 08-19Multilingual presentation ifla 2013 08-19
Multilingual presentation ifla 2013 08-19
 
Know Your Library And Become Information Literate 2
Know Your Library And Become Information Literate 2Know Your Library And Become Information Literate 2
Know Your Library And Become Information Literate 2
 
K2 elhanan adler_israelbibliographicdata
K2 elhanan adler_israelbibliographicdataK2 elhanan adler_israelbibliographicdata
K2 elhanan adler_israelbibliographicdata
 
K2 elhanan adler_israelbibliographicdata
K2 elhanan adler_israelbibliographicdataK2 elhanan adler_israelbibliographicdata
K2 elhanan adler_israelbibliographicdata
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
 
Arlington high school new york spring 2015
Arlington high school   new york spring 2015Arlington high school   new york spring 2015
Arlington high school new york spring 2015
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
 
EDS for JIBS
EDS for JIBSEDS for JIBS
EDS for JIBS
 
In want of a dataset: Text Analysis and the VRC, Catherine D. Adams
In want of a dataset: Text Analysis and the VRC, Catherine D. AdamsIn want of a dataset: Text Analysis and the VRC, Catherine D. Adams
In want of a dataset: Text Analysis and the VRC, Catherine D. Adams
 
Katayama2014
Katayama2014Katayama2014
Katayama2014
 
From MARC to LOD: preparing Wellcome Library metadata for discovery on the We...
From MARC to LOD: preparing Wellcome Library metadata for discovery on the We...From MARC to LOD: preparing Wellcome Library metadata for discovery on the We...
From MARC to LOD: preparing Wellcome Library metadata for discovery on the We...
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Library mangement system for schools levels
Library mangement system for schools levelsLibrary mangement system for schools levels
Library mangement system for schools levels
 
Poli127 guide (2020)
Poli127 guide (2020)Poli127 guide (2020)
Poli127 guide (2020)
 
Writing seminar youngspeter spring 2015
Writing seminar   youngspeter spring 2015Writing seminar   youngspeter spring 2015
Writing seminar youngspeter spring 2015
 
Dissertation2013
Dissertation2013Dissertation2013
Dissertation2013
 
Arlington high school sixties spring 2015
Arlington high school   sixties spring 2015Arlington high school   sixties spring 2015
Arlington high school sixties spring 2015
 
Links and Entities: The Library Data Revolution
Links and Entities: The Library Data RevolutionLinks and Entities: The Library Data Revolution
Links and Entities: The Library Data Revolution
 

Plus de The European Library

Linking Collections Through Linked Open Data
Linking Collections Through Linked Open DataLinking Collections Through Linked Open Data
Linking Collections Through Linked Open DataThe European Library
 
The european library ukb nienke 13 feb 2014
The european library   ukb nienke 13 feb 2014The european library   ukb nienke 13 feb 2014
The european library ukb nienke 13 feb 2014The European Library
 
Aubéry Escande - Europeana Newspapers - A new tool for researchers
Aubéry Escande - Europeana Newspapers - A new tool for researchersAubéry Escande - Europeana Newspapers - A new tool for researchers
Aubéry Escande - Europeana Newspapers - A new tool for researchersThe European Library
 
Europeana Newspapers: Surveying Newspaper Digitisation in European Libraries,...
Europeana Newspapers: Surveying Newspaper Digitisation in European Libraries,...Europeana Newspapers: Surveying Newspaper Digitisation in European Libraries,...
Europeana Newspapers: Surveying Newspaper Digitisation in European Libraries,...The European Library
 
Europeana Newspapers (Project Details and Aggregation Workflow)
Europeana Newspapers (Project Details and Aggregation Workflow)Europeana Newspapers (Project Details and Aggregation Workflow)
Europeana Newspapers (Project Details and Aggregation Workflow)The European Library
 
Europeana Newspapers Aggregation and Indexing Plan
Europeana Newspapers Aggregation and Indexing PlanEuropeana Newspapers Aggregation and Indexing Plan
Europeana Newspapers Aggregation and Indexing PlanThe European Library
 
Alastair Dunning, Open data at The European library, TEL
Alastair Dunning, Open data at The European library, TELAlastair Dunning, Open data at The European library, TEL
Alastair Dunning, Open data at The European library, TELThe European Library
 
Alastair Dunning, Europeana Newspapers, The European Library
Alastair Dunning, Europeana Newspapers, The European LibraryAlastair Dunning, Europeana Newspapers, The European Library
Alastair Dunning, Europeana Newspapers, The European LibraryThe European Library
 
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...The European Library
 
Alastair Dunning, Introduction to Europeana Cloud, The European Library
Alastair Dunning, Introduction to Europeana Cloud, The European LibraryAlastair Dunning, Introduction to Europeana Cloud, The European Library
Alastair Dunning, Introduction to Europeana Cloud, The European LibraryThe European Library
 
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...The European Library
 
Dunning welsh-newspapers-130314110640-phpapp01
Dunning welsh-newspapers-130314110640-phpapp01Dunning welsh-newspapers-130314110640-phpapp01
Dunning welsh-newspapers-130314110640-phpapp01The European Library
 
Dunning seedi-2013-130517083015-phpapp02
Dunning seedi-2013-130517083015-phpapp02Dunning seedi-2013-130517083015-phpapp02
Dunning seedi-2013-130517083015-phpapp02The European Library
 
Alastair Dunning, Breaking the waves, The European Library
Alastair Dunning, Breaking the waves, The European LibraryAlastair Dunning, Breaking the waves, The European Library
Alastair Dunning, Breaking the waves, The European LibraryThe European Library
 
Alastair Dunning, Challenges and Solutions in Creating a European Historic Ne...
Alastair Dunning, Challenges and Solutions in Creating a European Historic Ne...Alastair Dunning, Challenges and Solutions in Creating a European Historic Ne...
Alastair Dunning, Challenges and Solutions in Creating a European Historic Ne...The European Library
 
Alastair Dunning, Future Directions for The European Library
Alastair Dunning, Future Directions for The European Library Alastair Dunning, Future Directions for The European Library
Alastair Dunning, Future Directions for The European Library The European Library
 
Chiara Latronico,Europeana Cloud - Ingestion Clinic, The European Library
Chiara Latronico,Europeana Cloud - Ingestion Clinic, The European LibraryChiara Latronico,Europeana Cloud - Ingestion Clinic, The European Library
Chiara Latronico,Europeana Cloud - Ingestion Clinic, The European LibraryThe European Library
 
Chiara latronico, Europeana Collections 1914-1918 - Ingestion and Aggregation...
Chiara latronico, Europeana Collections 1914-1918 - Ingestion and Aggregation...Chiara latronico, Europeana Collections 1914-1918 - Ingestion and Aggregation...
Chiara latronico, Europeana Collections 1914-1918 - Ingestion and Aggregation...The European Library
 
Chiara Latronico, Europeana Cloud - Ingestion and Aggregation Workshop, The E...
Chiara Latronico, Europeana Cloud - Ingestion and Aggregation Workshop, The E...Chiara Latronico, Europeana Cloud - Ingestion and Aggregation Workshop, The E...
Chiara Latronico, Europeana Cloud - Ingestion and Aggregation Workshop, The E...The European Library
 

Plus de The European Library (20)

Linking Collections Through Linked Open Data
Linking Collections Through Linked Open DataLinking Collections Through Linked Open Data
Linking Collections Through Linked Open Data
 
Freire model api
Freire model apiFreire model api
Freire model api
 
The european library ukb nienke 13 feb 2014
The european library   ukb nienke 13 feb 2014The european library   ukb nienke 13 feb 2014
The european library ukb nienke 13 feb 2014
 
Aubéry Escande - Europeana Newspapers - A new tool for researchers
Aubéry Escande - Europeana Newspapers - A new tool for researchersAubéry Escande - Europeana Newspapers - A new tool for researchers
Aubéry Escande - Europeana Newspapers - A new tool for researchers
 
Europeana Newspapers: Surveying Newspaper Digitisation in European Libraries,...
Europeana Newspapers: Surveying Newspaper Digitisation in European Libraries,...Europeana Newspapers: Surveying Newspaper Digitisation in European Libraries,...
Europeana Newspapers: Surveying Newspaper Digitisation in European Libraries,...
 
Europeana Newspapers (Project Details and Aggregation Workflow)
Europeana Newspapers (Project Details and Aggregation Workflow)Europeana Newspapers (Project Details and Aggregation Workflow)
Europeana Newspapers (Project Details and Aggregation Workflow)
 
Europeana Newspapers Aggregation and Indexing Plan
Europeana Newspapers Aggregation and Indexing PlanEuropeana Newspapers Aggregation and Indexing Plan
Europeana Newspapers Aggregation and Indexing Plan
 
Alastair Dunning, Open data at The European library, TEL
Alastair Dunning, Open data at The European library, TELAlastair Dunning, Open data at The European library, TEL
Alastair Dunning, Open data at The European library, TEL
 
Alastair Dunning, Europeana Newspapers, The European Library
Alastair Dunning, Europeana Newspapers, The European LibraryAlastair Dunning, Europeana Newspapers, The European Library
Alastair Dunning, Europeana Newspapers, The European Library
 
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
Alastair Dunning, The successes of the Europeana Libraries project, The Europ...
 
Alastair Dunning, Introduction to Europeana Cloud, The European Library
Alastair Dunning, Introduction to Europeana Cloud, The European LibraryAlastair Dunning, Introduction to Europeana Cloud, The European Library
Alastair Dunning, Introduction to Europeana Cloud, The European Library
 
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
Alastair Dunning, Europeana Cloud: The Project and the Challenges of Assessin...
 
Dunning welsh-newspapers-130314110640-phpapp01
Dunning welsh-newspapers-130314110640-phpapp01Dunning welsh-newspapers-130314110640-phpapp01
Dunning welsh-newspapers-130314110640-phpapp01
 
Dunning seedi-2013-130517083015-phpapp02
Dunning seedi-2013-130517083015-phpapp02Dunning seedi-2013-130517083015-phpapp02
Dunning seedi-2013-130517083015-phpapp02
 
Alastair Dunning, Breaking the waves, The European Library
Alastair Dunning, Breaking the waves, The European LibraryAlastair Dunning, Breaking the waves, The European Library
Alastair Dunning, Breaking the waves, The European Library
 
Alastair Dunning, Challenges and Solutions in Creating a European Historic Ne...
Alastair Dunning, Challenges and Solutions in Creating a European Historic Ne...Alastair Dunning, Challenges and Solutions in Creating a European Historic Ne...
Alastair Dunning, Challenges and Solutions in Creating a European Historic Ne...
 
Alastair Dunning, Future Directions for The European Library
Alastair Dunning, Future Directions for The European Library Alastair Dunning, Future Directions for The European Library
Alastair Dunning, Future Directions for The European Library
 
Chiara Latronico,Europeana Cloud - Ingestion Clinic, The European Library
Chiara Latronico,Europeana Cloud - Ingestion Clinic, The European LibraryChiara Latronico,Europeana Cloud - Ingestion Clinic, The European Library
Chiara Latronico,Europeana Cloud - Ingestion Clinic, The European Library
 
Chiara latronico, Europeana Collections 1914-1918 - Ingestion and Aggregation...
Chiara latronico, Europeana Collections 1914-1918 - Ingestion and Aggregation...Chiara latronico, Europeana Collections 1914-1918 - Ingestion and Aggregation...
Chiara latronico, Europeana Collections 1914-1918 - Ingestion and Aggregation...
 
Chiara Latronico, Europeana Cloud - Ingestion and Aggregation Workshop, The E...
Chiara Latronico, Europeana Cloud - Ingestion and Aggregation Workshop, The E...Chiara Latronico, Europeana Cloud - Ingestion and Aggregation Workshop, The E...
Chiara Latronico, Europeana Cloud - Ingestion and Aggregation Workshop, The E...
 

Dernier

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 

Dernier (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 

Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility

  • 1. Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility Nuno Freire The European Library TPDL-2013 Valletta, September 2013
  • 2. Overview Statements of responsibility from library bibliographic data: “French Canadian freely arranged by Katherine K. Davis”. “ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by Coop Himmelblau.” “W. Lange, A.C. Zeven and N.G. Hogenboom, editors” “Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de Luis” Extracting work contributors for use in a rights infrastructure: ARROW http://arrow-net.eu
  • 3. Outline  The context • The ARROW rights infrastructure • The use of national bibliographies in ARROW     The problem The approach Evaluation Conclusion and future work
  • 4. The ARROW rights infrastructure  ARROW aims to support mass digitisation projects with automated ways to clear the rights of the books to be digitised.  To identify and clear the rights associated with a book a complex process needs to be undertaken: • • • • • Determine the work(s) contained within the book Identify all the other expressions of the same work(s) Identify the publisher(s) and contributor(s) involved Determine the dates of publication at work level Determine whether that work(s), and not the book itself, is still in commerce • If necessary, obtain any licenses from the rights holders or collective rights organizations 4
  • 5. What is ARROW  A rights infrastructure and system for the identification of: • Rights status • In or out of copyright • In or out of print / commercialised or not • Rights • Which rights are involved • Right holders • Authors • Publishers • How and where to clear the rights • Orphan Works and their registration 5
  • 6. Sources of Information in ARROW  ARROW makes information available from several sources: • The European Library: • National bibliographies - to identify the book and to cluster it with all other books containing the same intellectual work • Virtual International Authority File - to better identify the authors and support the identification of in copyright works • Books in Print database - to know if any of the books concerned are actively commercialised by any publisher • Reproduction Rights Organisation – to see if they know or can trace the rightholders 6
  • 8. The Role of Libraries The Role of Libraries ••NationalLibraries as Metadata Providers National Libraries as Metadata Providers •• Provide the National Bibliographies to The Provide the National Bibliographies to The European Library European Library
  • 9. The Role of The European Library (TEL) The Role of The European Library (TEL) •To match library requests with national bibliographies •To match library requests with national bibliographies ••Identifyall other manifestations that potentially share Identify all other manifestations that potentially share intellectual work with a manifestation intellectual work with a manifestation ••Tocreate a Work record: work metadata, manifestations, To create a Work record: work metadata, manifestations, contributors, etc. contributors, etc.
  • 10. The Role of Books-in-Print (BIP) The Role of Books-in-Print (BIP) ••Toprovide data about in print/out of print status To provide data about in print/out of print status ••Toprovide data about publishers To provide data about publishers •To add new manifestation records of the work •To add new manifestation records of the work
  • 11. The Role of Reproduction Rights Organisation (RRO) The Role of Reproduction Rights Organisation (RRO) •RROs as Metadata Provider •RROs as Metadata Provider •• To provide data about authors and publishers To provide data about authors and publishers •• To provide data about available licenses To provide data about available licenses … …
  • 12. Statements of responsibility  These statements usually contain information about authorship, editors, photographers, translators, and others involved in creating the work  In printed books, the statement of responsibility is typically present on the title page • The statement of responsibility is transcribed by the cataloguer exactly as it appears in the book (according to Anglo-American Cataloguing Rules)
  • 13. Examples of statements of responsibility “French Canadian freely arranged by Katherine K. Davis”. “ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by Coop Himmelblau.” “W. Lange, A.C. Zeven and N.G. Hogenboom, editors” “by Pamela and Neal Priestland” “Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de Luis”
  • 14. The problem  National bibliographies are reliable on representing in structured form the first author of a work  But secondary contributors are often not represented in structured form  Secondary contributors may reside only within the statements of responsibility
  • 15. The approach  To approach the problem as a Named Entity Recognition task in text that may not be grammatically correct, thus lacking lexical evidence  Some requirements from the ARROW context • Easily applicable to several languages • The outcomes of the recognition task must be explainable  Design decisions • Exploring the structured data within national bibliographies • By analysis of the frequency of word occurrences in names of persons, and in other textual data • Using word occurrence frequency allows to • bypass the need for building training sets • be able to provide simpler explanations of the name recognition results
  • 16. The process – pre-processing  A pre-processing of each national bibliography is performed: • Word frequency is calculated • The frequency values are normalized, for independence on the size of the national bibliography • The pre-processing results in four dictionaries: • • • • Words in titles Words in person’s surnames Words in other parts of person’s names, than the surname Words that appear in lowercase in person names (such as “von” in German names, or “de” in Portuguese names) • The dictionaries contain the normalized frequency associated the words
  • 17. The process – bibliographic record processing  The named entity recognition is performed for a record as follows: • Statement of responsibility is tokenized • The person names are recognized by comparing the tokens with the dictionaries • The recognized names are compared against the names of the contributors present in the structured fields of the record. • If no similar name exists in the record, the contributor is added to the record in a structured data field
  • 18. The process – named entity recognition Possible token sequences used to locate person names: (in Augmented Backus–Naur Form) non-ambiguous-surname / ( initial / non-ambiguous-first-name / non-ambiguous-surname / non-ambiguous-non-capitalized-name ) *(initial / first-name / surname / non-capitalized-name) surname (more details on the definition of these tokens are included in the paper)
  • 19. Evaluation data set (size of bibliographies and evaluation samples) National Bibliography British Library German National Library National Library of the Netherlands National Library of Greece Central Institute for the Union Catalogue of Italian Libraries Royal Library of Belgium Total records Main language Evaluation sample Statements of responsibility Referred Persons 13.4 million English 205 328 9.4 million German 200 378 3.2 million Dutch 200 335 0.4 million Greek 297 379 12.4 million Italian 224 297 203 387 1329 2104 1 million French and Dutch Total:
  • 20. Evaluation results Exact match metric Dataset Partial match metric Precision British Library German National Library National Library of the Netherlands National Library of Greece Central Institute for the Union Catalogue of Italian Libraries Royal Library of Belgium Overall: Recall Precision Recall 0.981 0.975 0.979 0.934 0.991 0.992 0.991 0.992 0.973 0.875 0.977 0.979 0.656 0.414 0.758 0.868 0.97 0.896 0.971 0.973 0.981 0.948 0.959 0.837 0.981 0.958 0.982 0.963
  • 21. Evaluation results analysis  The main causes of recognition errors: • Foreign person names negatively affected recall • Names of persons used in names of organizations negatively affected precision • Two persons with same surname mentioned together negatively affected recall. As for example: • “hrsg. von Volker und Michael Kriegeskorte” • “by Pamela and Neal Priestland”
  • 22. Conclusions  The approach performed reliably in most languages and bibliographic datasets • Datasets of at least one million records • Precision and recall above 0.97 on all but one dataset  The results obtained on the Greek national bibliography were not satisfactory • This dataset has distinct characteristics from the others: • smaller size, • a different alphabet • different language • Further investigation of the Greek national bibliography is necessary
  • 23. Future work  Evaluation of the impact of this solution on the final results of the rights clearance process of ARROW  Building the dictionaries from comprehensive source of names of persons • Virtual International Authority File (VIAF) • International Standard Name Identifier (ISNI)  Further functionality: • recognition of organization names • recognition of the role of the recognized contributors (illustrator, editor, etc.)  Other application scenarios • Functional Requirements for Bibliographic Records • Resource Description and Access
  • 24. Acknowledgments  The European Library • Marcela Strelcova, Chiara Latronico and Eva Kralt-Yap  Associazione Italiana Editori  University of Innsbruck  This work was partially supported by the ARROWplus project, with co-funding by the European Commission programme eContentplus Co-funded by the Community programme eContentplus
  • 25. T hank you Questions or comments? Contact: Nuno Freire – nuno.freire@kb.nl