This document summarizes the work of the British Library Labs over the last four years. It discusses how the Labs works with researchers, artists, librarians and others to experiment with the British Library's digital collections and datasets using techniques like text mining, image analysis, and crowdsourcing. It provides examples of projects that have unlocked hidden histories in messy textual data, mapped political meetings in newspapers, and identified trends in suicide reporting. The goal is to make the Library's intellectual heritage more accessible and to learn how to better support digital research.
Measures of Central Tendency: Mean, Median and Mode
British Library Labs Roadshow - Open University
1. 1
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
British Library Labs
What is British Library Labs and what have we learned over
the last four years?
1330 – 1430 and 1610-1630, 3 May 2017
Learning the Lessons of working with the British Library’s Digital Content and Data for your research
British Library data and collections and discussions and feedback on ideas, challenges and issues
Open University, Milton Keynes
https://goo.gl/9giuQW
Mahendra Mahey, Manager of British Library Labs
@BL_Labs and @mahendra_mahey
mahendra.mahey@bl.uk
2. 4
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
The British Library
Inside the British Library
Space for 1200 readers, around 400,000 visitors per year
Uses low oxygen and robots
Reading room and delivery to London
Document Supply and Storage at Boston Spa
Stockton-on-Tees
Author right to payment each time their books
are borrowed from public libraries.
St Pancras, London, UK
Many books are stored 4 stories below the building
Legal Deposit Library – Reference only
3. 5
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Living Knowledge Vision (2015 – 2023)
Custodianship Research Business
Culture Learning International
To make our intellectual heritage accessible to everyone,
for research, inspiration and enjoyment and be the most open, creative
and innovative institution of its kind by 2023.
Document:http://goo.gl/h41wW7 Speech:https://goo.gl/Py9uHK
Roly Keating (Chief Executive Officer of the British Library)
To make our intellectual heritage accessible to everyone,
for research, inspiration and enjoyment and be the most open, creative
and innovative institution of its kind by 2023.
4. 6
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Collections – not just books!
> 180*million items
> 0.8* m serial titles
> 8* m stamps
> 14* m books
> 3* m sound recordings
> 4* m maps
> 1.6* m musical scores
> 0.3* m manuscripts
> 60* m patents
King’s Library *Estimates
5. 7
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
http://www.bl.uk/projects/british-library-labs
Funded by the Andrew W. Mellon Foundation
6. 8
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
http://www.bl.uk/projects/british-library-labs
Funded by the Andrew W. Mellon Foundation
8. 10
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Digital research methods
Visualisations
Application Programming Interfaces
for datasets e.g. Metadata, Images
Transcribing
Annotation
Location based searching & Geo-tagging
Corpus analysis, Text Mining &
Natural Language Processing
Crowdsourcing
Human Computation
10. 12
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Competition
Awards
Projects
Tell us your ideas of what to do with our digital content
Show us what you have already done with our digital
content in research, artistic, commercial and learning and
teaching categories
Talk to us about working on collaborative projects
12. 14
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Why are doing this?
• Working closely with and listening to those who want use
our digital collections and data for their work
• We can learn how we are and should be supporting them:
– Access to digital collections?
– Advice, guidance, technical support, training
– Services, Tools and Processes?
– Many more reasons…
• Where are the gaps between what users want and what we
can give?
• How do we build the bridges to overcome the gaps?
13. 15
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Born digital
Data all around us!
/
Knowledge Quarter London
55 knowledge organisations within 1 mile radius of
Kings Cross, http://www.knowledgequarter.london
http://www.turing.ac.uk (Headquartered at the British Library)
UK Web Archive and e-legal deposit
http://www.webarchive.org.uk/ukwa/
https://goo.gl/pGO7QY
Born digital
Data all around us!
15. 17
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Playbills, Books, Newspapers
(includes OCR)
Digital collections and Datasets
British National
Bibliography
http://bnb.data.bl.uk
http://sounds.bl.ukhttp://dml.city.ac.uk/
Music (Recordings & Sheet) & Sounds
http://goo.gl/frSMJt
Broadcast News (TV and Radio)
http://goo.gl/cwThHw
http://goo.gl/pBkisZhttp://goo.gl/E8aRyQ Usage data
EtHOS
Web Archive
Images, Manuscripts & Maps
http://www.qdl.qa/
Qatar Digital Library
http://idp.bl.uk/
International
Dunhuang
Project
Maps
http://www.bl.uk/maps/
Hebrew Manuscripts
http://goo.gl/4sbCp9
Flickr &
Wikimedia Commons
https://goo.gl/LZRmaZ
16. 18
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Open Cultural Heritage Datasets
Collection Guides
Datasets about our collections
Bibliographic datasets relating to our published and
archival holdings
Datasets for content mining
Content suitable for use in text and data mining
research
Datasets for image analysis
Image collections suitable for large-scale image-
analysis-based research
Datasets from UK Web Archive
Data and API services available for accessing UK Web
Archive
Digital mapping
Geospatial data, cartographic applications, digital aerial
photography and scanned historic map materials
https://data.bl.uk
Discussion list: http://www.jiscmail.ac.uk/CULTURAL-HERITAGE-DATASETS
18. 20
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Typical pattern of research for Labs
•Finding invisible things in ‘messy’ historical
data
•Unearthing / unlocking hidden histories and
data to stimulate new research
•Celebrating hidden histories / data creatively
through events, art and performance
19. 21
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Finding things in messy OCR text
Mrs Folly
• Clean up some manually
• Get human ‘ground truth’
• Write code to find things
reliably in it automatically
• Try code on messy content
• Tweak if necessary
• Digital ‘lasso’ around content
• Human sift through
Mrs Folly
20. 22
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Code: Machine Learning / Reading
• Analogies to how humans read / learn
• Machines acquire ‘knowledge’ / data and use that knowledge
/ data to make sense / identify patterns
• Labs doing this on a case by case basis so methods can vary
• Need computational AND human effort
• Legalities of this process being ‘ironed’ out with publishers,
• Often a misunderstood area…
• Computers look for ‘patterns’ or the ‘essence’ of something
21. 23
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Smell of soup & Machine Learning
Thanks to Memo Akten (@memotv on twitter) for the inspiration!
https://goo.gl/toq4Bo
Nasreddin, 13th Century Turkish Sufi
http://web2.uvcs.uvic.ca/elc/studyzone/330/reading/smell1.htm
22. 24
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
http://victorianhumour.tubmblr.com
Victorian Meme Machine (2014)
https://goo.gl/HMqDt3
Bob Nicholson
http://victorianhumour.tumblr.com/
Bob Nicholson interviewed on
BBC Radio 4 Making History Programme:
http://goo.gl/fmV9ep
And telling jokes to the public:
http://goo.gl/xIDRhz
Bob obtained further funding from his university
Looking for more collaborations
https://www.youtube.com/watch?v=-GRgj7Q5OM0
Rob Walker, Victorian Mother-in-law Jokes
Victorian Comedy Night, 7 Nov 2016
23. 25
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Katrina Navickas (2015)
Political Meetings Mapper
http://politicalmeetingsmapper.co.uk
https://goo.gl/Qq78Oa
Labs Symposium 2015
https://goo.gl/BSA3be
Interview 2015
The Chartist Newspaper
http://goo.gl/vOLSnH
Chartist Monster Meeting
Chartists Walking Tour and
Re-enactment London
24. 26
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Black Abolitionist Performances & their
Presence in Britain (2016) – Hannah-Rose Murray
Frederick
Douglass
Ellen
Craft
Josiah
Henson
Ida B
Wells
A Performance by
Joe Williams &
Martelle Edinborough
http://frederickdouglassinbritain.com/
25. 27
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Data-mining verse in 18th Century newspapers
BL Labs Project 16-17, Jennifer Batt
https://goo.gl/5Akthd
Slides courtesy Jennifer BattJennifer Batt @ the BL on World Poetry Day
26. 28
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
What thoj' among ourrelves, with too much Heat, or t
W: fweutimes.wongle, wvhen we Ihould debate, W –
(A confequential Ill which Freedom drawvs, fl t
A bad Efficf, but from a noble Caufe) t
We can with univeifal Zcal advance, to
To cutb the faithlefs Arrogancccof V rance. hi
Dublin Journal
10-14 September,
1745
Slides courtesy Jennifer Batt
27. 29
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Verse: 81% lines begin
with initial capital
Prose: 52% lines begin
with initial capital
Westminster Journal 3
March 1745
Slides courtesy Jennifer Batt
28. 30
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Psychiatrist’s Journey into 19th
Century Newspapers
• Dr Surendra P Singh, Consultant Psychiatrist, Black
Country Partnership NHS Foundation Trust, Hon
Reader in Mental Health, University of
Wolverhampton
• To identify weekly, monthly, yearly and longitudinal
trends in suicide reporting in terms of gender, status,
sites, locations and health in OCR text of 19th
Century Newspapers
• Used ‘R’ Open Source Stats Package to collect
‘Suicide’ corpus
• Looking for collaborators to work on this dataset
29. 31
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Use of Overproof / OCR Correction?
Re-OCR with
ABBY FineReader?
https://www.abbyy.com/en-gb/
http://overproof.projectcomputing.com/
30. 32
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Virtual Infrastructure for OCR text
OCR text scraped from
digitised newspapers
and in cloud
Jupyter notebook
Write python code and results
in browser
http://jupyter.org
Access available for researchers ‘in residence’
32. 34
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Worked better for female faces than men’s
Press
http://mechanicalcurator.tumblr.com
Posts image every 30 minutes
http://www.flickr.com/photos/britishlibrary/
1,020,418 images
need tagging!
Creative uses of images
Face recognition
Mechanical Curator
http://goo.gl/qPPgxX
Flickr
Snipping out images
from 65,000 Digitised Books*
>600,000,000 views
>20,000,000 tags
https://goo.gl/FgZ4HM
Work @ BL by Ben O’Steen, Labs
and Digital Research Team*Matt Prior - http://goo.gl/j29Tnx
Since Dec 2013
Tumblr
33. 35
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Using other platforms to host BL collections
Links back to Library & community engagement
You can purchase
a ‘High Res’ Copy
View in the
Library Item Viewer
Download .pdf
All illustrations
in book
Other illustrations in books
Published in same year
View the item in
the Library Catalogue Tags auto generated
User generated
Tag
Grouping for image
Same on Wikimedia commons
British Library Flickr Commons Tags
35. 37
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Tagging a million images
Iterative Crowdsourcing
http://goo.gl/j6fxac
Cardiff University’s
Lost Visions Project
http://www.metadatagames.org/
Metadata Games
James Heald
Mario Klingemann
Chico 45
Use computational methods
Human Tagger
Top British Library Flickr Commons Taggers
36. 38
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Special Jury’s Prize (2015)
James Heald – Wikimedia and Map work
https://goo.gl/WYZCB2
http://goo.gl/HNQq5e
https://goo.gl/VPgffL
https://commons.wikimedia.org/
https://goo.gl/djtm1b
Labs Symposium (2015)Geotagging maps
54,000 Maps
Found in Flickr 1 million
Human & Computational
Tagging
& Community engagement
37. 39
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Adam Crymble (2015)
Crowdsource Arcade
http://goo.gl/LBfJ4W
http://goo.gl/OH9pOZ
https://goo.gl/7z0j8p
30 mins talk
Labs Symposium (2015)
https://goo.gl/SSRsdd
5 min interview (2015)
http://goo.gl/0APpE8
Game Jam
Using Arcade Games
to help Tag images
38. 40
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
SherlockNet: Competition Winner 2016
Karen Wang, Luda Zhao and Brian Do
Using Convolutional Neural Networks to Automatically Tag and Caption
the British Library Flickr Commons 1 million Image Collection
12 categories
>20 million tags added
>100,000 captions
bit.ly/sherlocknet
Pooled surrounding
OCR text on page
from similar images
Used Microsoft COCO (photographs) &
British Museum Prints and Drawings
collections as training sets.
Tags Captions
39. 41
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Artistic / Creative Works
http://goo.gl/dM8ieA
Mario Klingeman (2015)
https://www.youtube.com/watch?v=Q3SBxO34Zlc
David Normal 2014 and 2015
http://goo.gl/bNxGZZ
Kris Hoffman (2016)
https://goo.gl/QilqqT
Jiayi Chong 2016
https://www.facebook.com/RealmlandStory/
Paul Rand Pierce 2016
A Hat on the Ground Spells trouble
Tragic Looking Women
44 Men who Look 44
(Notice the direction faces)
40. 42
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Mario Klingemann 2016
https://www.youtube.com/watch?v=xgnxnmqnR7Y
Google Arts and Culture Lab – Experiments with Machine Learning
https://artsexperiments.withgoogle.com/
41. 43
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Imaginary Cities – BL Labs Project 16-17
Michael Takeo Magruder
https://goo.gl/4ARwTy
An artistic exploration seeking to create provocative fictional cityscapes for the Information Age
from the British Library’s digital collection of historic urban maps
43. 45
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Have you got X?
https://upload.wikimedia.org/wikipedia/commons/5/50/Real_wuerzburg.jpg
Looking for Physical Content in the British Library
44. 46
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Have you got X digitised?
http://www.yorkmix.com/wp-content/uploads/2014/04/mr-simms-sweet-shoppe-york.jpg
Looking for Digitised Content in the BL
45. 47
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
So little digitised?
• Digitisation costs time and resources…
• Still…over 650 Digital Collections but not all found through Google or
even online
• Dialogue is either:
– you are ‘lucky’ and we have the digital content relevant to your
research
– we don’t have exactly what your looking for, but is there anything of
interest? Let’s talk…
• Artists find this dialogue easier and we tend to attract researchers with
‘fuzzier’ research boundaries
• Access easier for openly licensed content
• More challenging for on-site and in-copyright contemporary material
47. 49
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
The Story of the Digital Collection…
Digital
Collection
Curator
Who paid for the digitisation?
Who did the digitisation?
Technology used
Born digital?
Published
Unpublished
Where is it?
Can it still be accessed?
Generates income
Reputational Risk
Legalities
Political
Ego Surprises
Metadata
Old format not supported
What media was the
digitisation done from?
Documentation
No Metadata
Messy Metadata
Still there?
Good to know the background of a
Digital collection if you want to use it for research…
48. 50
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Open Licensed Digital Content?
15% Openly
Licensed
Around 10%* available online
Working through
Breakdown by collection*
Manuscripts 59%
Books 9%
Maps and Views 7%
Newspapers 3%
Archives and Records 3%
Paintings, Prints and Drawings 2%
*Based on digitisation projects
Largest proportion of funding
Public / Private Partnership
15%* Openly Licensed
85%* Available onsite
*Estimates
49. 51
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
How do we give access to
onsite-only
Digital Collections
(85% of our Digital Collections)?
50. 52
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
READING
ROOM
ON
SITE
NOT
ONLINE
OPEN
British Library
£
Labs Residency Model
Challenges of access to Digital Collections
51. 53
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Accessing digital collections onsite
OPEN
£
• Have to be ‘onsite’
• Need to be security cleared for some collections
– Hence ‘Researcher in Residence Model’
• Permission required (depending on ‘story’ of collection)
• Content on various media formats
• 5-20 % re-use of material for non commercial research for
some collections
• We are learning ‘pathways’ so that this becomes ‘everyday’ to
provide onsite access in the future
52. 54
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Lessons Learned & Challenges…(1)
• Start with a conversation (external and internal), our data isn’t all on
Google (yet!) & not easy to find, need to create and embrace serendipity
and opportunities for use by talking!
• Need to have several conversations with several stakeholders and tap
into their tacit knowledge that isn’t always written down sometimes to
progress ideas.
• Often misunderstandings because of jargon & different meaning of
words.
• Learn the story of the collection
• Expectations change when researchers actually see the data, systems
and experience the ‘culture’ of the organisation.
• Opening collections requires some to need to let go of the emotional and
psychological connection to them
53. 55
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Lessons Learned & Challenges…(2)
• Embrace dirty data, it may never be perfect!
• We tend to work with researchers who can be ‘flexible’ with their research
questions and are willing to embrace challenges.
• Many researchers have the domain knowledge but lack the technical /
digital skills to use Digital Research methods. Should they be teamed up
with those that want to solve problems (computer science) or get trained?
• Identifying / bridging gaps for researchers to use data, help them ‘navigate’
through the Library to get the data they want (sometimes).
• Huge appetite to use digital content & data (e.g. Flickr Commons stats).
• Stimulate the imagination, work fast, give it energy
54. 56
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Labs mindset…
1. Start a conversation and try to support ideas
2. Start with small experiments, but think big!
3. Fail faster (don’t be afraid)
4. Reject perfectionism
5. Good enough is sometimes Good enough
6. Celebrate the uses of collections
55. 57
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
The Magic of Openness!
•If digitised / digital collections are not used, what is
the point of digitising / keeping them?
•Opening up our digital collections offers new ways
for the Library’s content to be remixed and re-
imagined
•Opening up our digital collections ‘re-energises’
them and the Library
•Generates plenty of examples to inspire use by
others
57. 59
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
The Future of BL Labs
•Continue to engage with researchers, learn what
they want to do and collect evidence of demand
•Develop Business Model and Support process to
make ‘Business as Usual’ at the British Library
•Help to create pathway to developing a Digital
Research Suite at the British Library
65. 67
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Accessing digitised newspapers
onsite at the BL
1
Windows 7
External access possible through Citrix Server
Results of digitisation exist on Windows file shares!
66. 68
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Accessing digitised newspapers
onsite at the BL (JISC 1)
2
12 Volumes, each with terabytes of data
77. 79
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Accessing digitised newspapers
onsite at the BL
13
Accessing original ‘master’ image (not
cropped or post processed)
Or ‘service’ copy (post processed)
and results of OCR available as ALTO XML
78. 80
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Accessing digitised newspapers
onsite at the BL
14a
Accessing original ‘master’ image
(not cropped or post processed) in .TIFF format
79. 81
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Accessing digitised newspapers
onsite at the BL
Accessing original
‘master’ image
(not cropped or post
processed)
14b
80. 82
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Accessing digitised newspapers
onsite at the BL
15a
Accessing ‘service’ Copy (post processed)
and results of OCR available as ALTO XML
81. 83
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Accessing digitised newspapers
onsite at the BL
Accessing ‘service’
Copy (post processed)
15b
86. 88
@mahendra_mahey @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/9giuQW
Explore or Imagine Our Data!
• CSV of Metadata
https://data.bl.uk/digbks/dig19cbooks-mdata-csv.csv
• 19th Century Books - Book Metadata - 01/09/2013.
https://data.bl.uk/digbks/db21.html
• Digitised Books - Flickr Tag History - Dec 2013 to March 2016.
TSV
https://data.bl.uk/digbks/db15.html
• Digitised Hebrew Manuscripts - Metadata
https://data.bl.uk/hebrewmanuscripts/heb1.html
• Digitised Hebrew Manuscripts: Or 2210 - Or 2364
https://data.bl.uk/hebrewmanuscripts/heb8.html
• Theatrical playbills from Britain and Ireland (OCR text only)
https://data.bl.uk/playbills/pb2.html
• Portraits of actors, views of theatres and playbills (covering
1750 - 1821 in a single volume)
https://data.bl.uk/singlesheet/por1.html
• Volumes of Lysons Collectanea (Amusements), comprising
broadsides, cuttings, advertisements on amusements.1660-
1840.
https://data.bl.uk/singlesheet/ad1.html
https://data.bl.uk
• Have a look at the data.
• Data Quality
• Issues
Or an idea you have thought of
what to do with the data!
http://labs.bl.uk/Ideas+for+Labs
Smaller datasets
25 Seconds (68 Words)
My name is Mahendra Mahey and I work on a project called British Library Labs. We are based at the British Library in London, in the Digital Scholarship department and we work closely with the Digital Research team there. It’s been running for three years now and is funded by the Andrew W. Mellon Foundation.
140 seconds
The British Library is the national library of the UK and one of the largest research libraries in the world . The Library moved to a new purpose built building in 1997 <click> the largest of it’s kind that was built in the UK in the 20th century. Many frequently used items are stored 5 stories below the main building at St Pancras in London and many might not know that part of the building is meant to look like a ship on a journey to discovery!<click>. <click to switch off>
The building can sit 1,200 researchers at any one time across 5 reading rooms.
<click>Medium and long term requested items are held at Boston Spa in Yorkshire in a low oxygen warehouse, using robot to retrieve items. In total, the library has 625 km of shelving, growing by 12 km every year.
Whilst we acquire items through purchase or gifts, much of the collection has been built up through legal deposit. That is, by law, a copy of every UK and Ireland print publication must be given to the British Library by its publishers. Around 3 million items are added per year. In 2013, legal deposit was extended to cover non-print material which means by law we take in digitally published items as well, which means regular mass crawls of the entire UK web domain as well as ebooks, ejournals etc.
85 seconds
The picture you can see is inside the main building in London, it’s the King’s Library – King George the Third’s personal library! Sometimes known as the ‘stack’, I walk past this everyday and I sometimes forget that the collections the British Library have are truly staggering! We currently estimate them to exceed <click>150 million items, representing every age of written civilisation and every known language. Our archives now contain the earliest surviving printed book in the world, the Diamond Sutra, written in Chinese and dating from 868 AD….
So some big numbers…
Over …<click>14 million books
<click>60 million patents
<click>8 million stamps
<click>4 million maps
<click>3 million sound recordings
<click>1.6 million music scores
<click>over .3 million manuscripts
<click>0.8 million serials titles (which are of course made up of many many volumes/editions), this is where a lot of our content is, just in case you thought the numbers didn’t add up!
33 Seconds (100 Words)
In a nutshell the project encourages researchers, artists, entrepreneurs, educators and anyone else,
<Click>
to ‘experiment’ with our digital collections and data. We are particularly interested in those who have questions which focus on the potential to find and create NEW things through access to the digital content. For example, being able to ask a question across thousands of digitised books or newspapers using computational techniques would not feasible using manual methods. Let’s look at a clear example.
<Click>
33 Seconds (100 Words)
In a nutshell the project encourages researchers, artists, entrepreneurs, educators and anyone else,
<Click>
to ‘experiment’ with our digital collections and data. We are particularly interested in those who have questions which focus on the potential to find and create NEW things through access to the digital content. For example, being able to ask a question across thousands of digitised books or newspapers using computational techniques would not feasible using manual methods. Let’s look at a clear example.
<Click>
Get clearer annotation image and transcription (perhaps TILT)
6 Seconds (20 Words)
So <Click> ‘how’ do we try and engage those who might be interested in the BL’s digital collections and data? <Click>
17 Seconds (53 Words)
<Click>The British Library is one of the largest Library’s in the world <Click> with an estimated 180 million physical items, with only a small proportion being digitised. <Click>We estimate this is around 1-2%, but no one really knows exactly how much. However, increasingly more items are being stored as ‘born’ digital, such as the UK Web Archive<Click>
Have balance of Multimedia
Broadcast news and radio, sounds asave our sounds
Books and newspapers
Images
BNB
Qatar Digital library
Hebrew manuscripts
21 Seconds (65 Words)
Katrina Navickas was particularly interested in the <Click>Chartist Movement who were a group who were campaigning for the vote for working people. <Click>They were the biggest popular movement for democracy in 19th century British history, just as this is early picture shows a huge monster meeting at Kennington Common<Click>She wanted to use a combination of manual and computational methods to explore our Digitised Newspapers to find out when and where they met and plot them on map. <Click>and hopefully unearthing new history.
970 files from a selection of 19th century newspaper titles from the BL corpus for us to correct using the overProof post-OCR correction software
The best way to measure the improvement made by the correction process is to compare the OCR'ed text and the automatically corrected text with a perfect correction made by a human (known as the "ground truth").
Hannah-Rose's 5 small human-corrected samples are show as green dots. These are not only smaller than the other files, but their raw error rate is much lower at 13.3%. OverProof was measured as reducing this to 5.4%, a removal of almost 60% of errors.
The red dotted-line indicates the correction "break-even" point: the further under the line, the better the quality of the document after correction.
In the graph below, the grey line shows distribution of files across error rates before correction and the green line after correction.
Posts small illustrations taken almost at random from the digitised book corpus to a Tumblr blog.
This experiment with undirected engagement was a by-product of work to uncover the hidden wealth of illustrations within the digitised pages.
50 seconds
Here is the anatomy of a Flickr record, importantly we have created links to many of the Library’s services <click>some of this lovely traffic is going back to the Library and hopefully generating more interest in our services, from downloading a pdf of the book to purchasing a high res scan of the image.
<click>Tags are added from the original book record, including the approximate page number the image came from<click>users of Flickr can add their own tags, and I have mentioned they have already started doing it.
18 Seconds (56 Words)
Indexing BL the 1 million & Mapping the Maps – was led by James Heald and collaboration with others <Click>They produced an index of 1 million 'Mechanical Curator collection' images on <Click>Wikimedia Commons from a collection of largely un-described images. <Click>This gave rise to finding 50,000 maps within the collection partially through a map-tag-a-thon <Click>These are now being geo-referenced. <Click>
27 Seconds (82 Words)
Adam Crymble <Click>wanted to harness the power of playing fun games on arcade machines to help with crowdsourcing the tagging of un-described images. He particularly wanted to engage a younger audience into crowdsourcing .<Click>On the right you can see a replica 1980’s arcade machine we built and <Click>and on the bottom left some tagging games that were developed through a ‘Games Jam’ for the machine. <Click>. Let’s take a closer look at two of the games…<Click>
17 Seconds (53 Words)
<Click>The British Library is one of the largest Library’s in the world <Click> with an estimated 180 million physical items, with only a small proportion being digitised. <Click>We estimate this is around 1-2%, but no one really knows exactly how much. However, increasingly more items are being stored as ‘born’ digital, such as the UK Web Archive<Click>
<click>The British Library faces many challenges of access to our Digital collections!
<click> Sometimes digital content is only available onsite due to license restrictions,
<click>or even only on a specific computer in a reading room! Technically there are very few reasons why digital content can’t be online
<click> though it might be too big or hasn’t been transferred from other digital storage media.
<click>Sometimes access is through a paywall. Finally,
<click>some content is in the happy sunny place, online, open and freely available.
The real reasons why there are challenges to accessing digital content are of course human. They require different approaches from the Library and may often involve an honest, open dialogue and negotiation with the publishers.
The Labs project has tried to address this problem my creating a ‘residency model’ for researchers to work intensively with a digital collection on-site, so as to not infringe access conditions, I will say more about this later.
<click>The British Library faces many challenges of access to our Digital collections!
<click> Sometimes digital content is only available onsite due to license restrictions,
<click>or even only on a specific computer in a reading room! Technically there are very few reasons why digital content can’t be online
<click> though it might be too big or hasn’t been transferred from other digital storage media.
<click>Sometimes access is through a paywall. Finally,
<click>some content is in the happy sunny place, online, open and freely available.
The real reasons why there are challenges to accessing digital content are of course human. They require different approaches from the Library and may often involve an honest, open dialogue and negotiation with the publishers.
The Labs project has tried to address this problem my creating a ‘residency model’ for researchers to work intensively with a digital collection on-site, so as to not infringe access conditions, I will say more about this later.