IAC 2024 - IA Fast Track to Search Focused AI Solutions
Digital Berkshire, April 2012: Chris Clark, British Library PT#2
1.
2. ● Scale & materiality
– Not individual, standard documents but vast collections of
them; authenticity demands multiplicity of versions
● Cost
– Preservation not by individuals but large organizations
● Intellectual Property
– If content worth saving someone is making money
from it
4. “The challenge for libraries is to find ways to preserve
platform dependent digital works and to prevent the loss of
complex digital media…. Since we cannot possibly save
everything, we need to carefully consider which digital
materials are the most important to preserve and try to
anticipate the needs of future scholars and researchers”
Marlene Manoff, 2006
If preservation priority is X and user need is Y,
what are the values of X and Y?
5. If sustainability means that information is kept useful and available, then the LOCKSS
approach has real merit! It implies that SERVICES must be preserved as well.
6. Abundance of stored content:
attention is scarce & must be earned
Services
Content
7. platforms to focus in 2012+
Maintain active presence on
Continue to assess
8. Digital Research & Curator Team
digital commercia
l
scholars
pa
academic
rt n
social digital consortia
ers
ers
networks Digital funding
us
hip
Research & bodies
machines
s
Curator Team media
systems & services
M
ser ation
ns
ret h &
al
es
g
l
/ST
o
se oria
riev
s
vic
rc
etin
ibiti
ce
v
sea
s er
eIS
exh
rvi
rat
pre
rk
cu
ma
9. Training & development:
seminars, conferences, events,
‘Digital Conversations’
Extend + ‘Tooling up’
Collaborate Digital Scholarship:
horizon scanning, Tech Watch
communities of practice, consortia
Consolidate
Digital Curation as
collaborative process:
acquisitions, workflows, tools,
project management, funding,
exhibitions & marketing
10. Europeana – SB Berlin User generated content
Centenary of the outbreak of the First Roadshows in 10 countries to create
World War unique pan-European archive
Will create a European corpus of Preston event produced more than
digitised materials concerning the First 2300 images from letters, diaries,
World War in all its aspects medals, pictures, trench art, and more
Will contribute to Europeana a
substantial collection of more than
400,000 outstanding sources
11. British Newspaper Archive Google Books
British Library and brightsolid online A 6 year project starting June 2011
publishing 250,000 Books, 1700-1870
Up to 40 million newspaper pages from From the French Revolution to the
the British Library's collection over 10
end of slavery.
years
Collection includes runs of most
Material in major European languages
newspapers published in the UK since Focus on books that are not yet
1800 freely available in digital form online
Over 4m pages added since launch Access via Google Books and BL
Storage at Google and BL
Contract and terms available on the
web!
12. Broadcast News IMPACT Historic Text
TV & radio news receivable in the UK, Improve the digital accessibility of
since May 2010, e.g. Al-Jazeera English, printed text produced before 1900
CNN, France 24, Russia Today
OCR does not produce
Search subtitles (where available)
satisfactory results for old books,
AHRC-funded project looking at magazines and newspapers
speech-to-text technologies for
opening up audio and video archives Historic material have archaic
Project will index 3,00 hours of TV fonts, complex layouts, warped or
news and 3,000 hours of radio content degraded pages
Manual post-correction is slow and
expensive
13. Early music on-line: digitised 300 volumes (21k
images) of rare early printed music from the
British Library’s collections
Open educational licence encourages use and re-
purposing of content and embedding in teaching
and research
Detailed inventories of the books’ contents
created for the first time, with access points for
composer and title
Data included in British Library catalogue, COPAC
and RISM music database, with links to digitised
content
Digital images provided to Aruspix, which is
developing an OCR and transcription tool for
early music
www.earlymusiconline.org
14. Personal digital archives Web archives
Data analysis beyond documents Create a research collection of UK
Use computer forensics websites
Capture, management, description, Develop high-impact data analytical
and preservation of personal digital access services
collections to facilitate access and Demonstrate the potential of domain
analysis level web archives, or the “haystacks”
Archives range from poets (W Cope) UK web domain > 9m .uk domain
and playwrights (H Pinter) to names
computer scientists (D Michie) and Estimate 110TB/crawl
biologists
15. Goal • Approach
Builds on previous crowd- • Accessible and convenient application
sourcing projects, e.g. UK • Immediate results and feedback
SoundMap
• Competitive tools
Addressed key challenges –
awareness, engagement, • Recognition and visible contribution
productivity at scale
17. sults:
725 maps assigned spatial metadata over 5 days
Publicity minimal – social media key
~90 participants, top five completed half the work
Data quality good: <3% had errors
20. Evolution by projects and
commercial ties tends to
reduce interoperability and
inconveniences the
researcher
International collaborations,
such as International Image
Interoperability Framework,
seek a shared canvas
21. ARROW project - a tool to assist ‘diligent
search’ and provide faster answers to:
Rights status? – Rightsholders? – Can I digitise?
2008 2009 2010 2011 2012 2013
ARROW
29 Partners ARROW Plus
Libraries, BIP, Reprographic Rights
Organisation (UK) 36 Partners
12 countries 14 countries
(Austria, Denmark, France, Finland, (Austria, Belgium, Bulgaria, Germany,
Germany, Italy, the Netherlands, Norway, Greece, Hungary, Ireland, Italy, Latvia,
Slovenia, Spain, Lithuania, the Netherlands, Poland,
Sweden, UK) Portugal, Spain)
Pilots: Germany, France; Spain; UK Books and images in books
Books only
21
22. ARROW benefits
Automated (where it can be – still some manual processes)
Therefore saves time and cost
ARROW search = 5 % of Manual search time
National partners working together across different sectors
Domain partners working together across countries
22
23. Persistent enquiry: can I use this?
Open Knowledge Foundation
Creative Commons Licenses
Persistent URLs
24. Six decades into the computer revolution,
four decades since the invention of the microprocessor,
and
two decades into the rise of the modern internet, all of
the technology required to transform industries through
software finally works and can be delivered at global
scale.
Marc Andreessen ‘Why software is eating the world’
Wall Street Journal August 20 2011
25. Our vision: In 2020, the British Library will be a leading
hub in the global information network, advancing
knowledge through our collections, expertise and
partnerships, for the benefit of the economy and society
and the enrichment of cultural life.
If Andreessen is right, we may not be talking in 2020
about digital libraries and digital curators but an agency
for the curation and creation of software.
More than 200 people poured into the Museum of Lancashire in Preston at the weekend to have their loved-one’s precious items digitised for the virtual archive www.europeana1914-1918.eu/ Queuing began an hour before doors opened on Saturday (10.03.12) – and the crowds continued to stream in until they closed nine hours later. Many people had travelled from as far afield as Leeds, Manchester, Birkenhead, Liverpool and Warrington just to be there. More than 2,300 images were taken of a wide variety of items, including: letters, diaries, medals, birth and death certificates, nurses’ autograph books, cartoons, pictures and trench art – everyday objects made from anything the soldiers found, such as shell casings and spent ammunition. The Preston roadshow is the latest in a series that is being rolled-out across 10 countries in Europe this year to create a unique pan-European account of WW1 that is available to everyone. Europeana 1914-1918 brings together a partnership of libraries, museums, academic and cultural institutions, which in the UK includes the British Library, Oxford University, JISC and Lancashire County Council. Page
Commercial partners: + it gets the job done - It can lock up information in silos while investment is recouped. The first works to be digitised by Google will range from feminist pamphlets about Queen Marie-Antoinette (1791), to the invention of the first combustion engine-driven submarine (1858), and an account of a stuffed Hippopotamus owned by the Prince of Orange (1775). Page
Luke McKernan (Lead Curator, Moving Image) and AHRC Page
Page
Showing relationships between Target, Subject, Event, Special Collection. The Targets are colour coded to high level subject area and the size of the node represents the number of target instances. The UK web domain 9 million .uk domain names registered in December 2010 ~ 1 million using other domain names Growing at 11% - 14% per year 40% estimated to be in scope for Legal Deposit Estimated ~110TB each UK domain crawl Traditional “document-centric” approach does not scale up - canonical mission of heritage institutions being challenged Many technical challenges – the constant need to respond to the evolving web Harvests are at best snapshots or samples cannot get everything: resource and legal constraints; Crawler works well with HTML but struggles to capture advanced web content, e.g. rich media, dynamic and interactive content Rendering software does not always “replay” the archived content Cannot reply streaming media Risks of “republishing” – libel, copyright Legal Deposit offers some protection but access restricted to premises of LD institutions Page
This work was only continued in England when the threat of attack by France loomed, with the start of the Napoleonic Wars in 1795. Started on a scale of six-inches-to-the-mile for the south coast, which was then reduced at a later point. Thenceforth all mappong to be done by Board of Ordnance. Done by 1815. Upon this was based the original 1-inch maps of England. I’ve taken a detail of Exeter
These geo-tools allow historic maps to be overlaid and combined with modern mapping, enhancing the ability to compare and analyse the representation, and enabling searching by placename.