A hands-on data exploration & challenge to become a derived data-set author on the British Library’s open data-set platform (https://data.bl.uk)

1
@BL_Labs #DHA2018 @BL_DigiSchol labs@bl.uk
http://www.bl.uk/projects/british-library-labs
Funded by the Andrew W. Mellon Foundation
Running since March 2013
A hands-on data exploration & challenge to become a derived data-set author
on the British Library’s open data-set platform (https://data.bl.uk)
Mahendra Mahey, Manager of BL Labs, British Library, London, UK.
1400 – 1530, Tuesday 25 September 2018
Workshop part of ‘Making Connections’, Digital Humanities Australasia, 2018
(#DHA2018), University of South Australia, City West campus, Adelaide, SA, Australia

2
Who do we work with?
Researchers
https://goo.gl/WutNyi Artists
http://goo.gl/nNKhQ2
Librarians
Curators
https://goo.gl/9NWZUW
Software Developers
https://goo.gl/7QQ5Tf
Archivists
https://goo.gl/x7b4tg
Educators
https://goo.gl/qh01Mi
Working and Communicating
Entrepreneurs
https://goo.gl/Fx8RG7

3
Competition
Awards
Projects
Tell us your ideas of what to do with our digital content (2013-16)
Show us what you have already done with our digital content in research,
artistic, commercial, learning and teaching, staff categories
Talk to us about working on collaborative projects
Tell us your ideas of what to do with our digital content
Engagement
• Roadshows
• Events
• Meetings
• Conversations
New! Digital Research Support
How?

4
Collections – not just books!
> 180*million items
> 0.8* m serial titles
> 8* m stamps
> 14* m books
> 6* m sound recordings
> 4* m maps
> 1.6* m musical scores
> 0.3* m manuscripts
> 60* m patents
King’s Library *Estimates

5
Have you got X?
https://upload.wikimedia.org/wikipedia/commons/5/50/Real_wuerzburg.jpg
Looking for Physical Content in the British Library

6
#bldigital
3 %* digitised
* estimate
Digital
Partnerships
Commercial & Other
Organisations
Bias in digitisation
http://goo.gl/bR9UJL
Sample Generator
15 %* Openly Licensed – most online
85 %* Available onsite only at the moment
Digitisation / Curating Born Digital
costs money, time, resources
http://www.turing.ac.uk
Digital increasing
rapidly
Born Digital
http://www.webarchive.org.uk/ukwa/

7
Have you got X digitised / in digital form?
http://www.yorkmix.com/wp-content/uploads/2014/04/mr-simms-sweet-shoppe-york.jpg
Looking for Digitised / Digital Content in the BL

8
Our Audience and Collections
Audience
research &
Digital
interests
Digital
collections we
have
This is where Labs works
It starts with a making connections!
The theme to DHA2018

9
Finding Open Cultural Heritage Datasets
Collection Guides (219 as of 25/09/2018)
https://www.bl.uk/collection-guides/
Datasets about our collections
Bibliographic datasets relating to our published and archival holdings
Datasets for content mining
Content suitable for use in text and data mining research
Datasets for image analysis
Image collections suitable for large-scale image-analysis-based research
Datasets from UK Web Archive
Data and API services available for accessing UK Web Archive
Digital mapping
Geospatial data, cartographic applications, digital aerial photography and
scanned historic map materials
https://data.bl.uk
Download collections as zips, no API
Each dataset has a Digital Object Identifier (DOI)
can be referenced for research
Not all discoverable via
search engines!

10
Explore Our Data at http://data.bl.uk!
• CSV of Metadata
https://data.bl.uk/digbks/dig19cbooks-mdata-csv.csv
• 19th Century Books - Book Metadata - 01/09/2013.
https://data.bl.uk/digbks/db21.html
• Digitised Books - Flickr Tag History - Dec 2013 to March 2016. TSV
https://data.bl.uk/digbks/db15.html
• Digitised Hebrew Manuscripts - Metadata
https://data.bl.uk/hebrewmanuscripts/heb1.html
• Digitised Hebrew Manuscripts: Or 2210 - Or 2364
https://data.bl.uk/hebrewmanuscripts/heb8.html
• Theatrical playbills from Britain and Ireland (OCR text only)
https://data.bl.uk/playbills/pb2.html
• Portraits of actors, views of theatres and playbills (covering 1750 - 1821 in a single volume)
https://data.bl.uk/singlesheet/por1.html
• Volumes of Lysons Collectanea (Amusements), comprising broadsides, cuttings, advertisements on
amusements.1660-1840. https://data.bl.uk/singlesheet/ad1.html

11
The Story of the Digital Collection…
Digital
Collection
Curator
Who paid for the digitisation?
Who did the digitisation?
Technology used
Born digital?
Published
Unpublished
Where is it?
Access / API?
Can it still be accessed?
Generates income
Reputational risk in using?
Legalities /
Ethics / Morality
Politics when digitised
Personalities involved
Surprises (e.g. gaps)
Descriptive information
Old format not supported
What media was the
digitisation done from?
Is there any background documentation?
No Descriptive information
Inconsistent descriptive information
Still there?
Good to know the background ‘story’ of a Digital Collection
if you want to use it for projects …

12
https://goo.gl/qpCLlk
https://goo.gl/wMTS3Z
• Dialogue typically:
– you are ‘lucky’ & we have the digital content
/ data relevant to your research
– we don’t have exactly what your looking for,
but is there anything of interest? Let’s talk…
– engagement is hard work and it’s constantly required to
maintain interest in our digital collections!
• Artists find this dialogue easier…
• We also tend to attract researchers with ‘fuzzier’ research
boundaries and possibly open to more
interdisciplinary / collaborative research
What engagement does the BL have with researchers
wanting use our digital content?

13
Open Content vs Onsite Only Access
• Access easier for openly licensed content
• More challenging for on-site, in-copyright, non-print legal
deposit, data protected, old content media & contemporary material (post 1877)
https://goo.gl/Y5zCXg
©

14
How do we give access to
onsite-only
Digital Collections
(85% of our Digital Collections)?

15
READING
ROOM
ON
SITE
NOT
ONLINE
OPEN
British Library
£
Labs Residency Model
Challenges of access to Digital Collections at the BL

16
Accessing digital collections onsite
OPEN
£
• Have to be ‘onsite’ (interpretations vary)
• Need to be ‘security cleared’ ‘trusted’ for some collections
– Hence ‘Researcher in Residence Model’
• Permission required (depending on ‘story’ of collection)
• Content could be on various media formats
(not always online)
• 5 - 20 % re-use of material for non commercial research for some collections,
depends on agreements in place
• We are learning ‘pathways’ so that this becomes ‘everyday’ to provide onsite access
to some digital collections in the future

17
Phases of interaction at BL Labs
Submit idea for
support
Ideas always change
Once people experience the data
and culture of the organisation

18
eResearch SA Open Data Directory
http://www.data.sa.edu.au/

19
URLs to download sample files not on data.bl.uk
• https://www.data.sa.edu.au/dataset/newspapers-from-british-library/
• https://www.data.sa.edu.au/dataset/
• https://www.data.sa.edu.au/dataset/

20
Working with British Library Digitised Newspapers
• Digitised through public / private means
• Can use commercial products to look manually for content, with search
interfaces but no APIs, useful starting point though, manual methods can
translate into computational ones
• OCR quality is not great, metadata is OK, but plenty of hidden material,
approaches require to consider this, e.g. ‘Good, Bad and Ugly’ OCR
• You can purchase drives from GALE Cengage with content (dependent on
subscription)

21
Good, Bad, Ugly Image Quality / OCR
• Original image capture of newspaper images can effect the quality of the OCR
• A poor image, very difficult to re-OCR
• Good image quality much better chance for re-OCR
• Bi-tonal, Grey Scale, Colour can effect the quality of the OCR
• Methodology of working with collection at scale needs to acknowledge OCR and
image quality

22
Breaking Black Boxes – Melodee Beals
http://doi.org/cm3m

23
Burney Collection
• Gathered by the Reverend Charles Burney (1757- 1817)
• 700 volumes, newspapers and news pamphlets, published in London, English
provincial, Irish and Scottish papers, and a few examples from the American
colonies.
• 1271 titles
• Around 1 million digitised page images – from around 2006 from Microfilm
• OCR quality mixed, used custom XML format
• Bi-tonal

24
Web Interface – Burney Collection

25
OCR quality can be very poor!

26
1268 Folders

27
burney_summary.xls

28
Breakdown of titles
Title No. of Pages
PUBLIC ADVERTISER 60680
LONDON GAZETTE 44463
LONDON EVENING POST 38920
LONDON CHRONICLE 32030
GAZETTEER AND NEW DAILY ADVERTISER 31250
LLOYD'S EVENING POST 28941
ST. JAMES'S CHRONICLE OR THE BRITISH EVENING POST 28130
MORNING CHRONICLE AND LONDON ADVERTISER 27658
DAILY COURANT 25334
GENERAL EVENING POST 23500
12 TITLES WITH 10,000+ PAGES 188266
87 TITLES WITH 1,000+ PAGES 289745
216 TITLES WITH 100+ PAGES 79374
945 TITLES WITH 1 TO 100 PAGES 16816

29
Example Folders
B0001ORIWEEJO - APPLEBEE''S ORIGINAL WEEKLY JOURNAL - 1715 – 1720
B0018CONTPROC - PROCEEDINGS OF THE ARMY UNDER THE COMMAND OF SIR
THOMAS FAIRFAX – 1645
B0054REPINFCH - REPORT OF THE STATE OF THE GENERAL INFIRMARY AT
CHESTOR - 1754?-1779
B0101PROCPARL - EXACT RELATION OF THE PROCEEDINGS AND TRANSACTIONS
OF THE LATE PARLIAMENT – 1654
B0277INSTRUCT - INSTRUCTOR – 1724
B1381SCOU1717 - SCOURGE (1717, REPRINT) - 1717?

30
Example files
‘service’ folder contains page level images and corresponding OCR XML
BurneyB0001ORIWEEJO17151119service

31
APPLEBEE''S ORIGINAL WEEKLY JOURNAL
FROM SATURDAY NOVEMBER 19 TO SATURDAY NOVEMBER 26 1715
WO2_B0001ORIWEEJO_1715_11_19-0001.tiff

32
JISC 1 and JISC 2
Newspapers

33
Accessing digitised newspapers
through Gale Interface (subscription)

34
Private BL NAS
Accessible onsite or remotely if security cleared via CITRIX

35
onsite at the BL (JISC 1)
12 Volumes, 80TB of data

36
onsite at the BL
Accessing ‘service’ Copy (post processed)
and results of OCR available as XML

37
onsite at the BL
Accessing ‘service’
Copy (post processed)

38
onsite at the BL
Accessing OCR as XML

39
jisc_1.xls
79 Titles, 2 million pages

40
Metadata from BL (JISC 1 and 2)
• Title Metadata
– Title, as written
– Normalised title across all
variants
– Standardised title
abbreviation
– Variant titles, with associated
dates
– Place of publication
– Dates of publication
– Genre, such as newspaper
– Sub-collection, such as
Regional Daily
Issue Metadata
Volume Number
Issue Number
Date as printed
Normalised date (YYYY.MM.DD)
Number of pages
The microfilm reel number
The OCR quality
Page image data
The number of the image within that
issue
The filename
The spatial coordinates for the page
within the image
The degree of page skew

41
Metadata from Gale (JISC 1 and 2)
• Standardised identifier
• Newspaper title
• Standardised title abbreviation
• Project codes
• Digitized collection name
• Issue number
• Date as printed
• Standardised date (Month, DD,
YYYY)
• Standardised date
(YYYYMMDD)
• Day of the week
• Number of Pages
• Copyright holder
Language
Unique ID for publication
Holding Library
Citation of the physical item
Title metadata
Title as recorded in the MARC
Library Catalogue
Dates of publication
Genre, such as newspaper
Conversion credit, usually a vendor
Article
Unique ID
OCR quality
SC, or standardized category of article
Unique ID(s) of page(s)
Unique ID(s) of individual column(s)
Column number
Headline
Article type

42
Samples for JISC 1
‘master’ contains high res tiff
‘service’ contains post processed tiff and OCR XML
BNWL - The Belfast News-Letter - 1871 - November 14
BNWL - The Belfast News-Letter - 1885 - September 12
DNLN - Daily News - 21 Jan 1846 - 31 Dec 1900

43
JISC 2 Collection
• 22 Titles
• Regional titles
• 1020550 pages

44
jisc_2.xls

45
JISC 2
• 40 TB
• Stored differently locally
192,353 folders

46
Samples for JISC 2
• Organised differently

47
Samples for JISC 2
Lancaster Gazetter, And General Advertiser For Lancashire West
Southampton Herald
Berrows Worcester Journal
A - Contains post processed files
M - Contains JP2
O - Contains ALTO XML

48
Previous ideas of using collection
• Bob Nicholson – Finding jokes
• Katrina Navickas – Political meetings
• Hannah Murray – Black abolitionist performances
• Jennifer Batt – Finding poetry
• Surendra Singh – Finding suicide articles
• Melodee Beals – Evidence of copy and paste
• Ryan Cordel – Viral Texts
• Paul Fyfe - Snipping out images

49
Useful resources
• http://oceanicexchanges.org/
• http://scissorsandpaste.net/
• http://viraltexts.org/
• https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33457/fyfe.newspaper.ar
chaeology.VPR.pdf?sequence=1

50
Use of Overproof
OCR Correction?
Re-OCR with
ABBY FineReader?
https://www.abbyy.com/en-gb/
http://overproof.projectcomputing.com/
RE-OCR

51
Virtual Infrastructure for OCR text
OCR text ‘scraped’ from
digitised newspapers
and put in cloud
Jupyter notebook
Write python code and results
in web browser
http://jupyter.org
Access available for researchers ‘in residence’
https://www.docker.com/

52
65,000 digitised 19th Century books
Image: Artwork by Alicia Martin 2007 / 2008
Paid for by:
For a full list:
https://goo.gl/HqPQMS
Subjects include:
Philosophy
Poetry
History
Literature
1789 - 1876

53
Working with the MS Books Collection
• Metadata
• Page level images
• OCR Text
• Flickr Commons - images snipped out and user generated tags for images
• 19th Century Books Collection data

54
30 August 2012

55
Metadata
MicrosoftBooks.xls - Over 65,000 titles

56
MS Books – Finish Titles

57
Fiction / Non Fiction

58
Latin American Studies

59
ALTO XML – Sample Files – 1800 - 1809
1502 Zip Files

60
OCR Text – JSON File

61
002819694

62

63

64
Optically Character Recognised (OCR)
generated Text
Scanned Page
Image on Flickr
Commons
https://goo.gl/AC43vs

65
Worked better for female faces than men’s
Press
http://mechanicalcurator.tumblr.com
Posts image every 30 minutes
http://www.flickr.com/photos/britishlibrary/
1,020,418 images
need tagging!
Creative uses of images
Face recognition
Algorithms based on photos
Mechanical Curator
with an algorithmic brain
(Circles, Squares and Slanty etc)
http://goo.gl/qPPgxX
Wikimedia
Flickr Commons
Individual URL & API
Snipping out images
from 65,000 Digitised Books*
>1000,000,000* views
>17,000,000* tags
https://goo.gl/FgZ4HM
Work @ BL by Ben O’Steen, Labs
and Digital Research Team*Matt Prior - http://goo.gl/j29Tnx
Since Dec 2013
Tumblr
*Estimates
>More demand to see
physical items

66
British Library Flickr Commons
https://www.flickr.com/photos/britishlibrary/
Flickr Commons has items from
Galleries, Libraries, Archives and Museums (GLAM)
(Mostly Public Domain)

67
Flickr Commons (100 + GLAMs as of 25/09/18)

68
Getting an account on Flickr
•Get a Flickr / Yahoo account
(https://login.yahoo.com/account/create)
•You can then tag, organise favourites, make your own
albums and galleries from Flickr images online or uploaded
•You get 1TB for free!

69
British Library Flickr Commons
Why Flickr Commons?
• Free!
• Each image has it’s own unique web address, easy to share
• Can Tag images
• Has Application Programming Interface (API)
Late August 2013

70
Using British Library Flickr Commons
•How do we find things in this collection?
•Remember snipped out images from books with no
description?
•Not straightforward…

71
How is Flickr Commons Organised?
• Photostream
• Albums
• Faves
• Galleries
• Tags

72
Flickr Photostream
https://www.flickr.com/photos/britishlibrary/
Kind of the home page for the collection!
Usually displays images with most recent activity!

73
Flickr Albums
Curated by the British Library – specifically Nora McGregor
She works with the public to add images or create new ones!
Over 450 Albums as of 25/09/18 – Mostly Maps!

74
Flickr Faves
Most favorited image first in descending order
To favourite an image requires an account

75
Flickr Galleries
More useful if you have an account
You can create a Gallery of Flickr images to share with everyone
Gallery is tied to your account

76
Flickr Groups
Community based – for sharing and discussing images
We might create a group for the competition – watch this space!

77
Adding Tags in Flickr
Be the next ‘Chico45’!

78
Get Tags!

79
Searching within the collection!

80
The Anatomy of a BL Flickr Record
Download
high res
300dpi image

81

82
When you log in to Flickr Commons

83

84
Opportunities
– increasing traffic to Library services
You can purchase
a ‘High Res’ Copy
View in the
Library Item Viewer
Download .pdf
All illustrations
in book
Other illustrations in books
Published in same year
View the item in
the Library Catalogue Tags auto generated
User generated
Tag
Grouping for image

85
Refers to the
Physical Copy of
the Item

86

87

88
Physical and Digital Copy
Number relates to Physical Copy

89

90

91

92

93

94

95

96
You can’t beat the Physical Copy!

97
Now for the Digital Copy!

98

99

100

101
Warning – can be large file!
It’s aPDF
You can do Ctrl F in it to find text
But health warning about OCR!

102

103
Page numbers don’t always correspond!
Page numbers
Don’t always correspond
Page 132 on Flickr?
Is Page Number in PDF
In PDF of
book
Page number
in book

104

105
Plain Text from Books?
Not working
But can be obtained from https://data.bl.uk/digbks/db14.html

106
All illustrations in book / books in same year!
All the illustrations in this book Other illustrations books published
in the same year

107
Views and Favourites

108
Galleries
•Personal Galleries which you can share.

109
Exchangeable Image File Information!
For Geeks only!

110
Tags!

111
Tagging a million images
Iterative Crowdsourcing
http://goo.gl/j6fxac
Cardiff University’s
Lost Visions Project
http://www.metadatagames.org/
Metadata Games
James Heald
Mario Klingemann
Chico 45
Use computational methods
Human Tagger
Top British Library Flickr Commons Taggers
18 hard core taggers
How to reward and keep motivated this ‘small group?
Average for ‘crowd’ is 1 tag per person
What kind of ‘task’ can this ‘crowd’ do?
Mobile games for ‘Ships’, ‘Covers’ and ‘Portraits’ Interface for tagging

112
Adding Tags!
•You have to have an account to add tags!
•Could you be the next Chico 45?

113
Generated from book
Description
Generated from user

114
Generated by Flickr

115
Flickr Commons API
https://www.flickr.com/services/api/

116
Generated by SherlockNet!
bit.ly/sherlocknet

117
Sherlocknet has a search interface!

118
SherlockNet Search for ‘people’

119
Advanced Search in SherlockNet!
Tags Available for Download

120
19th Century Books Metadata
• 1,9 Million records of 19th Century Books
• Used for Sample generator project

121
Using the Wikimedia Synoptic Index
• Created to help find all the maps in the books
• Great resource if you want to find things by place!
https://goo.gl/zuxRnG

122
Google Fusion Table
• https://fusiontables.google.com/DataSource?docid=1BMm0FeSsEBa40zgs3C3v
ySKC0gnPk-pSvrDqqnA7&pli=1#rows:id=1

123
Geodata
flickr_geodata.csv

124
Alston Index
Internal Document
55-602 - Topical Index
603 - 925 - Pressmark Sequence925 page document of BL /
British Museum Pressmarks

125
Alston Index
• Internal document (not to be externally shared)
• Published in 1987 – dot matrix printed
• Refers to British Museum and British Library Pressmarks / Shelfmarks
• Shelfmarks are used internally to identify

126
Topical Index
OCR problems – Re-do? Manually correct?

127
Augment Library Catalogue?

128
Libcrowds – In the Spotlight
https://www.libcrowds.com/collection/playbills/projects

129
Libcrowds – Spotlight - Data
https://www.libcrowds.com/collection/playbills/data

130
Data Journey
• Choose one or two datasets maximum
• Explore the collection and make notes about any challenges and issues
• See if you can curate a smaller collection from the larger collection
• Tell us what you have done
• We will consider to publish on http://data.bl.uk

A hands-on data exploration & challenge to become a derived data-set author on the British Library’s open data-set platform (https://data.bl.uk)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (14)

Similaire à A hands-on data exploration & challenge to become a derived data-set author on the British Library’s open data-set platform (https://data.bl.uk)

Similaire à A hands-on data exploration & challenge to become a derived data-set author on the British Library’s open data-set platform (https://data.bl.uk) (20)

Plus de labsbl

Plus de labsbl (13)

Dernier

Dernier (20)

A hands-on data exploration & challenge to become a derived data-set author on the British Library’s open data-set platform (https://data.bl.uk)

Notes de l'éditeur