SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Giovanni Colavizza
Matteo Romanello (@mr56k)
Frédéric Kaplan (@frederickaplan)
The References of References:
Enriching Library Catalogs via Domain-Specific
Reference Mining
1
Goal
2
Empowering scholars in
the Humanities with better
IR systems
Motivation - the Scholar
Issues: lack of data [Sula and Miller, 2014] leads to absence of
services: estimated coverage of Web of Science for Humanities
circa 13% [Mingers and Leydesdorff, 2015].
3
Sciences:
Google Scholar
English
mainly papers
Lower-cost information
gathering
Humanities:
no Google Scholar-like system
multiple languages
mainly monographs
Higher-cost information
gathering
Motivation - the Footnote
How humanists cite? Footnotes [see e.g. Hellqvist, 2009]
4
Motivation - the Archive
Approximately half citations to primary sources [Wiberley Jr., 2009]
5
Motivation - the Scholar reloaded
6
Proposal: Enriching library catalogs
7
Use reference monographs, the “canon” of the
domain, to extract references to the rest of the
literature and enrich library catalogs.
Project: Linked Books
Focused on a case study/domain:
the history of Venice.
Partners so far:
• Ca’ Foscari University Library System
• Biblioteca Marciana
• Istituto Veneto di Scienze, Lettere ed Arti
• Archivio di Stato di Venezia
• EPFL
8
The Pipeline
9
Corpus selection
10
Result: 1904 monographs, 701 with
a structured list of references.
Use the means of the library:
1- Consultation shelves
2- Dewey and subject classification
3- Scholarly bibliographies
4- Keyword search
The Pipeline - Digitization
11
Digitization
12
1,904 monographs + ~1,000 journal issues
The Pipeline - Annotation/Extraction/Parsing
13
Annotation
14
• annotated 27% of 701 monographs (with reference list)
• 3.8% of all digitized pages (with references)
• annotators identified 33 citation styles, divided into 6 families
• Yes, humanities scholars love customized reference styles!
Reference Extraction/Parsing
15
[Klinkhammer b-i-secondary-full] [Lutz, i-secondary-full] [L’occupazione i-
secondary-full] [tedesca i-secondary-full] [in i-secondary-full] [Italia i-secondary-full]
[1943-1945, i-secondary-full] [Torino, i-secondary-full] [Bollati i-secondary-full]
[Boringhieri i-secondary-full] [1993 i-secondary-full].
Klinkhammer Lutz, L’occupazione tedesca in Italia 1943-1945,
Torino, Bollati Boringhieri 1993.
[Klinkhammer author] [Lutz, author] [L’occupazione title] [tedesca title]
[in title] [Italia title] [1943-1945, title] [Torino, publicationplace] [Bollati
publisher] [Boringhieri publisher] [1993 publicationyear].
Extraction/Parsing - Evaluation
16
Extraction/Parsing - Confusion Matrix
17
null
author
title
abbrev. (E)
monograph (E)
Task 1
F1 score
(avg) 0.806
class=“null” 0.609
Task 2
F1 score
(avg) 0.842
class=“end abbreviated” 0.242
The Pipeline - Lookup
18
Lookup
19
1. Against OPAC SBN (via API)
Steps:
1. search candidates by title
2. match reference metadata
3. assign each candidate a
confidence score
4. return set of candidates
Evaluation:
• 2k references (out of 181k)
• 41.7% no candidates
• 58.3% with candidates:
• 72.3% -> first candidate
correct
Goal: disambiguation of references
Issues:
• OCR errors -> impact on search by title (low recall)
• API as a “black box” + bottleneck of search by title
Lookup
20
2. Against metadata of digitized books
Lookup
Goal: verify cohesiveness of digitized corpus
Method:
• based on SBN lookup
• but lookup against digitization
metadata
• tuned to maximize precision
• returns 1 or no matches
Evaluation*:
• 500 references (out of 181k)
• precision ~ 1.00
• recall > 0.95
Result:
• only 7% of references extracted from 701 monographs point
inwards (i.e. towards the 1904 monographs)
21
Core of the discipline
co-citation network from
extracted references*
giant component = 59%
of selected corpus
books in the giant
component -> core of
reference works on
history of Venice
giant component ->
32.5% with only works in
consultation
Conclusions and Outlook
22
data- and citation-driven approach to assess and
exploit, from an IR point of view, domain-specific
library holdings on the history of Venice
next big challenge: extraction, consolidation and
disambiguation of references contained within
footnotes (journals)
Giovanni Colavizza
Matteo Romanello (@mr56k)
Frédéric Kaplan (@frederickaplan)
Thank you!
go.epfl.ch/linkedbooks
23

Contenu connexe

Similaire à The References of References: Enriching Library Catalogs via Domain-Specific Reference Mining

Dissertations 5 ref, plagiarism, own crit-analysis
Dissertations 5   ref, plagiarism, own crit-analysisDissertations 5   ref, plagiarism, own crit-analysis
Dissertations 5 ref, plagiarism, own crit-analysis
Study Hub
 
Goldminers of the Digital Age: How Libraries are Selecting, Presenting, and D...
Goldminers of the Digital Age: How Libraries are Selecting, Presenting, and D...Goldminers of the Digital Age: How Libraries are Selecting, Presenting, and D...
Goldminers of the Digital Age: How Libraries are Selecting, Presenting, and D...
Northern California Technical Processes Group
 
Bibiliography Footnotes Oral Presentation PPT 17.pptx
Bibiliography Footnotes Oral Presentation PPT 17.pptxBibiliography Footnotes Oral Presentation PPT 17.pptx
Bibiliography Footnotes Oral Presentation PPT 17.pptx
Jamshi8
 
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
Dissertations 5   ref, plagiarism, own crit-analysis [handout]Dissertations 5   ref, plagiarism, own crit-analysis [handout]
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
Study Hub
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Writing Right: Teaching Writing Conventions Specific to a Discipline
Writing Right: Teaching Writing Conventions Specific to a DisciplineWriting Right: Teaching Writing Conventions Specific to a Discipline
Writing Right: Teaching Writing Conventions Specific to a Discipline
Robert Domanski
 

Similaire à The References of References: Enriching Library Catalogs via Domain-Specific Reference Mining (20)

Exploratory computing: designing discovery-driven user experiences
Exploratory computing: designing discovery-driven user experiencesExploratory computing: designing discovery-driven user experiences
Exploratory computing: designing discovery-driven user experiences
 
INF 100 Tutorial
INF 100 TutorialINF 100 Tutorial
INF 100 Tutorial
 
Dissertations 5 ref, plagiarism, own crit-analysis
Dissertations 5   ref, plagiarism, own crit-analysisDissertations 5   ref, plagiarism, own crit-analysis
Dissertations 5 ref, plagiarism, own crit-analysis
 
Goldminers of the Digital Age: How Libraries are Selecting, Presenting, and D...
Goldminers of the Digital Age: How Libraries are Selecting, Presenting, and D...Goldminers of the Digital Age: How Libraries are Selecting, Presenting, and D...
Goldminers of the Digital Age: How Libraries are Selecting, Presenting, and D...
 
Linked Books - DH Venice Fall School 2014
Linked Books - DH Venice Fall School 2014Linked Books - DH Venice Fall School 2014
Linked Books - DH Venice Fall School 2014
 
Romanello tokyo
Romanello tokyoRomanello tokyo
Romanello tokyo
 
Bibiliography Footnotes Oral Presentation PPT 17.pptx
Bibiliography Footnotes Oral Presentation PPT 17.pptxBibiliography Footnotes Oral Presentation PPT 17.pptx
Bibiliography Footnotes Oral Presentation PPT 17.pptx
 
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
Dissertations 5   ref, plagiarism, own crit-analysis [handout]Dissertations 5   ref, plagiarism, own crit-analysis [handout]
Dissertations 5 ref, plagiarism, own crit-analysis [handout]
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Towards knowledge maintenance in scientific digital libraries with the keysto...
Towards knowledge maintenance in scientific digital libraries with the keysto...Towards knowledge maintenance in scientific digital libraries with the keysto...
Towards knowledge maintenance in scientific digital libraries with the keysto...
 
02 Literature search and reviewing_1.pptx
02  Literature search and reviewing_1.pptx02  Literature search and reviewing_1.pptx
02 Literature search and reviewing_1.pptx
 
Portland Place School June 2018
Portland Place School June 2018 Portland Place School June 2018
Portland Place School June 2018
 
Order #185993101 writers choice (5 pages, 4 slides)type of serv
Order #185993101 writers choice (5 pages, 4 slides)type of servOrder #185993101 writers choice (5 pages, 4 slides)type of serv
Order #185993101 writers choice (5 pages, 4 slides)type of serv
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Print source literature 24 March 2023.pptx
Print source literature 24 March 2023.pptxPrint source literature 24 March 2023.pptx
Print source literature 24 March 2023.pptx
 
Collection evaluation techniques for academic libraries
Collection evaluation techniques for academic libraries Collection evaluation techniques for academic libraries
Collection evaluation techniques for academic libraries
 
Writing Right: Teaching Writing Conventions Specific to a Discipline
Writing Right: Teaching Writing Conventions Specific to a DisciplineWriting Right: Teaching Writing Conventions Specific to a Discipline
Writing Right: Teaching Writing Conventions Specific to a Discipline
 
Arc 323 human studies in architecture fall 2018 lecture 3-literature review
Arc 323 human studies in architecture fall 2018 lecture 3-literature reviewArc 323 human studies in architecture fall 2018 lecture 3-literature review
Arc 323 human studies in architecture fall 2018 lecture 3-literature review
 
McGill Library and your thesis
McGill Library and your thesisMcGill Library and your thesis
McGill Library and your thesis
 
Where data and journal content collide: what does it mean to ‘publish your da...
Where data and journal content collide: what does it mean to ‘publish your da...Where data and journal content collide: what does it mean to ‘publish your da...
Where data and journal content collide: what does it mean to ‘publish your da...
 

Plus de Giovanni Colavizza

Udine Digital Humanities 19/11/2013
Udine Digital Humanities 19/11/2013Udine Digital Humanities 19/11/2013
Udine Digital Humanities 19/11/2013
Giovanni Colavizza
 
Venezia Biblioteche e Digital Humanities 28/10/2013
Venezia Biblioteche e Digital Humanities 28/10/2013Venezia Biblioteche e Digital Humanities 28/10/2013
Venezia Biblioteche e Digital Humanities 28/10/2013
Giovanni Colavizza
 
Mainz Expert Workshop on Controlled Vocabularies 10/10/2013
Mainz Expert Workshop on Controlled Vocabularies 10/10/2013Mainz Expert Workshop on Controlled Vocabularies 10/10/2013
Mainz Expert Workshop on Controlled Vocabularies 10/10/2013
Giovanni Colavizza
 

Plus de Giovanni Colavizza (14)

Sul ruolo dell’umanista nelle Digital Humanities
Sul ruolo dell’umanista nelle Digital HumanitiesSul ruolo dell’umanista nelle Digital Humanities
Sul ruolo dell’umanista nelle Digital Humanities
 
La Venice Time Machine e alcune sfide dei progetti “Big Science” nelle discip...
La Venice Time Machine e alcune sfide dei progetti “Big Science” nelle discip...La Venice Time Machine e alcune sfide dei progetti “Big Science” nelle discip...
La Venice Time Machine e alcune sfide dei progetti “Big Science” nelle discip...
 
A Cliometrics’ view on the Garzoni database
A Cliometrics’ view on the Garzoni databaseA Cliometrics’ view on the Garzoni database
A Cliometrics’ view on the Garzoni database
 
Venice 1740 Reconstruction
Venice 1740 ReconstructionVenice 1740 Reconstruction
Venice 1740 Reconstruction
 
Notes de bas de page: d’un outil savant aux hyperliens
Notes de bas de page: d’un outil savant aux hyperliensNotes de bas de page: d’un outil savant aux hyperliens
Notes de bas de page: d’un outil savant aux hyperliens
 
Introduction to the Venice Time Machine
Introduction to the Venice Time MachineIntroduction to the Venice Time Machine
Introduction to the Venice Time Machine
 
Mapping Early Modern News Networks
Mapping Early Modern News NetworksMapping Early Modern News Networks
Mapping Early Modern News Networks
 
Report on Ongoing Digitisation and Information System Design for VTM
Report on Ongoing Digitisation and Information System Design for VTMReport on Ongoing Digitisation and Information System Design for VTM
Report on Ongoing Digitisation and Information System Design for VTM
 
Mapping the News Networks in XVII Italy
Mapping the News Networks in XVII ItalyMapping the News Networks in XVII Italy
Mapping the News Networks in XVII Italy
 
Garzoni conference 11 October 2014
Garzoni conference 11 October 2014Garzoni conference 11 October 2014
Garzoni conference 11 October 2014
 
Leipzig Functional Categorisation 11/12/2013
Leipzig Functional Categorisation 11/12/2013Leipzig Functional Categorisation 11/12/2013
Leipzig Functional Categorisation 11/12/2013
 
Udine Digital Humanities 19/11/2013
Udine Digital Humanities 19/11/2013Udine Digital Humanities 19/11/2013
Udine Digital Humanities 19/11/2013
 
Venezia Biblioteche e Digital Humanities 28/10/2013
Venezia Biblioteche e Digital Humanities 28/10/2013Venezia Biblioteche e Digital Humanities 28/10/2013
Venezia Biblioteche e Digital Humanities 28/10/2013
 
Mainz Expert Workshop on Controlled Vocabularies 10/10/2013
Mainz Expert Workshop on Controlled Vocabularies 10/10/2013Mainz Expert Workshop on Controlled Vocabularies 10/10/2013
Mainz Expert Workshop on Controlled Vocabularies 10/10/2013
 

Dernier

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Dernier (20)

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 

The References of References: Enriching Library Catalogs via Domain-Specific Reference Mining

  • 1. Giovanni Colavizza Matteo Romanello (@mr56k) Frédéric Kaplan (@frederickaplan) The References of References: Enriching Library Catalogs via Domain-Specific Reference Mining 1
  • 2. Goal 2 Empowering scholars in the Humanities with better IR systems
  • 3. Motivation - the Scholar Issues: lack of data [Sula and Miller, 2014] leads to absence of services: estimated coverage of Web of Science for Humanities circa 13% [Mingers and Leydesdorff, 2015]. 3 Sciences: Google Scholar English mainly papers Lower-cost information gathering Humanities: no Google Scholar-like system multiple languages mainly monographs Higher-cost information gathering
  • 4. Motivation - the Footnote How humanists cite? Footnotes [see e.g. Hellqvist, 2009] 4
  • 5. Motivation - the Archive Approximately half citations to primary sources [Wiberley Jr., 2009] 5
  • 6. Motivation - the Scholar reloaded 6
  • 7. Proposal: Enriching library catalogs 7 Use reference monographs, the “canon” of the domain, to extract references to the rest of the literature and enrich library catalogs.
  • 8. Project: Linked Books Focused on a case study/domain: the history of Venice. Partners so far: • Ca’ Foscari University Library System • Biblioteca Marciana • Istituto Veneto di Scienze, Lettere ed Arti • Archivio di Stato di Venezia • EPFL 8
  • 10. Corpus selection 10 Result: 1904 monographs, 701 with a structured list of references. Use the means of the library: 1- Consultation shelves 2- Dewey and subject classification 3- Scholarly bibliographies 4- Keyword search
  • 11. The Pipeline - Digitization 11
  • 12. Digitization 12 1,904 monographs + ~1,000 journal issues
  • 13. The Pipeline - Annotation/Extraction/Parsing 13
  • 14. Annotation 14 • annotated 27% of 701 monographs (with reference list) • 3.8% of all digitized pages (with references) • annotators identified 33 citation styles, divided into 6 families • Yes, humanities scholars love customized reference styles!
  • 15. Reference Extraction/Parsing 15 [Klinkhammer b-i-secondary-full] [Lutz, i-secondary-full] [L’occupazione i- secondary-full] [tedesca i-secondary-full] [in i-secondary-full] [Italia i-secondary-full] [1943-1945, i-secondary-full] [Torino, i-secondary-full] [Bollati i-secondary-full] [Boringhieri i-secondary-full] [1993 i-secondary-full]. Klinkhammer Lutz, L’occupazione tedesca in Italia 1943-1945, Torino, Bollati Boringhieri 1993. [Klinkhammer author] [Lutz, author] [L’occupazione title] [tedesca title] [in title] [Italia title] [1943-1945, title] [Torino, publicationplace] [Bollati publisher] [Boringhieri publisher] [1993 publicationyear].
  • 17. Extraction/Parsing - Confusion Matrix 17 null author title abbrev. (E) monograph (E) Task 1 F1 score (avg) 0.806 class=“null” 0.609 Task 2 F1 score (avg) 0.842 class=“end abbreviated” 0.242
  • 18. The Pipeline - Lookup 18
  • 19. Lookup 19 1. Against OPAC SBN (via API) Steps: 1. search candidates by title 2. match reference metadata 3. assign each candidate a confidence score 4. return set of candidates Evaluation: • 2k references (out of 181k) • 41.7% no candidates • 58.3% with candidates: • 72.3% -> first candidate correct Goal: disambiguation of references Issues: • OCR errors -> impact on search by title (low recall) • API as a “black box” + bottleneck of search by title
  • 20. Lookup 20 2. Against metadata of digitized books Lookup Goal: verify cohesiveness of digitized corpus Method: • based on SBN lookup • but lookup against digitization metadata • tuned to maximize precision • returns 1 or no matches Evaluation*: • 500 references (out of 181k) • precision ~ 1.00 • recall > 0.95 Result: • only 7% of references extracted from 701 monographs point inwards (i.e. towards the 1904 monographs)
  • 21. 21 Core of the discipline co-citation network from extracted references* giant component = 59% of selected corpus books in the giant component -> core of reference works on history of Venice giant component -> 32.5% with only works in consultation
  • 22. Conclusions and Outlook 22 data- and citation-driven approach to assess and exploit, from an IR point of view, domain-specific library holdings on the history of Venice next big challenge: extraction, consolidation and disambiguation of references contained within footnotes (journals)
  • 23. Giovanni Colavizza Matteo Romanello (@mr56k) Frédéric Kaplan (@frederickaplan) Thank you! go.epfl.ch/linkedbooks 23